基本情况 主编致辞 收录获奖
 编委会 编辑部 审稿专家
 本刊学术规范 行业规范

doi: 10.11925/infotech.2096-3467.2017.0723
Is Big Data Analytics Beyond the Reach of Small Companies?

Abstract:

Big data analytics is often prohibitively costly. It is typically conducted by parallel processing with a cluster of machines, and is considered a privilege of big companies that can afford the resources. This position paper argues that big data analytics is accessible to small companies with constrained resources. As an evidence, we present BEAS, a framework for querying big relations with constrained resources, based on bounded evaluation and data-driven approximation.

Key words: Big data analytics ; Bounded evaluation ; Data-driven approximation ; Constrained resources

1 Introduction

Big data analytics is expensive. Textbooks tell us that a computation problem is tractable if there exists a polynomial-time algorithm for it, i.e., its cost can be expressed as a polynomial in the size n of the input[1]. However, it is no longer the case when it comes to big data. When n is big, algorithms in O(n2) or even O(n) time may take too long to be practical. Indeed, assuming the largest Solid State Drives (SSD) with 12GB/s for read[2], a linear scan of a dataset of 15TB takes more than 20 minutes. It easily takes hours to join tables with millions of tuples[3]. In other words, many computation problems that are typically considered tractable may become infeasible in the context of big data.

One might be tempted to think that parallel computation could solve the problem, by adding more processors when needed. However, small businesses often have constrained resources and cannot afford large-scale parallel computation.

Is big data analytics a privilege of big companies? Is it beyond the reach of small companies that can only afford constrained resources?

We argue that big data analytics is possible under constrained resources. As an example, we consider relational query answering. Given an SQL query Q and a relational database D, it is to compute the answers Q(D) to Q in D. Relational data accounts for the majority of data in industry. Moreover, it is nontrivial to compute Q(D). Indeed, it is NP-complete to decide whether a given tuple t is in Q(D) when Q is a simple SPC query (selection, projection and Cartesian product), and is PSPACE-complete when Q is in relational algebra, both subsumed by SQL.

We propose BEAS, a new query evaluation paradigm to answer SQL queries under constrained resources. The idea is to make big data small, i.e., to reduce queries on big data to computation on small data. Underlying BEAS are two principled approaches: (1) bounded evaluation that computes exact answers by accessing a bounded amount of data when possible[4-8], and (2) a data-driven approximation scheme that answers queries for which exact answers are beyond reach under bounded resources, and offers a deterministic accuracy bound[9].

As proof of concept, we have developed a prototype system[10] and evaluated it with call-detailed-record (CDR) queries at Huawei. We find that bounded evaluation improves the performance of CDR queries by orders of magnitude[11], and is able to reduce big datasets from PB (1015) to GB (109) in many cases.

Below we give a brief introduction to bounded evaluation (Section 2), followed by data-driven approximation scheme (Section 3). Putting these together, we present BEAS, our resource-bounded query evaluation framework (Section 4).

2 Bounded Evaluation

Given a query Q posed on a big dataset D, bounded evaluation[7] aims to compute Q(D) by accessing only a bounded subset DQ of D that includes necessary information for answering Q in D, instead of the entire D. To identify DQ, it makes use of an access schema A, which is a set of access constraints, i.e., a combination of simple cardinality constraints and their associated indices.

Under access schema A, query Q is boundedly evaluable if for all datasets D that conform to A, there exists a fraction DQD such that

Q(DQ) = Q(D), i.e., DQ suffices for computing exact answers Q(D); and

◦the time for identifying DQ and hence the size |DQ| of DQ are determined by Q and A only. That is, the cost of computing Q(DQ) is independent of |D|.

Intuitively, Q(D) can be computed by accessing DQ. We identify DQ by reasoning about the cardinality constraints in A, and fetch it by using the indices in A.

Example 1: Consider a database schema R0 consisting of three relations:

(a) person(pid, city), stating that pid lives in city,

(b) friend(pid, fid), saying that fid is a friend of pid, and

(c) poi(address, type, city, price), for the type, price and city of points of interest.

An example access schema A consists of the following two access constraints:

φ1: friend(pid → fid, 5000),

φ2: person(pid → city, 1).

Here φ1 is a constraint imposed by Facebook[12]: a limit of 5000 friends per person; and φ2 states that each person lives in at most one city. An index is built for φ1 such that given a pid, it returns all fids of pid from friend, i.e., φ1 includes the cardinality constraint (5000 fids for each pid) and the index; similarly for φ2.

Consider a query Q1 to find the cities where my friends live, which is taken from Graph Search of Facebook[13]. Written in SQL, Q1 can be expressed as:

select p.city

from friend as f, person as p

where f.pid = p0 and f.fid = p.pid,

where p0 indicates “me”. When an instance D0 of R0 is “big”, e.g., Facebook has billions of users and trillions of friend links[12], it is costly to compute Q1(D0).

However, we can do better since Q1 is boundedly evaluable under A: (a) we first identify and fetch at most 5000 fids for person p0 from relation friend by using the index for φ1, and (b) for each fid fetched, we get her city by fetching 1 tuple from relation person via the index for φ2. In total we fetch a set DQ of 10,000 tuples, instead of trillions; it suffices to compute Q1(D0) by using DQ only[4].

The notion of access schema was proposed in[8], and the foundation of bounded evaluation was established in[5,7]. The theory of bounded evaluation was first evaluated using SPC queries[6] and then extended to relational algebra(RA)[4]. A challenge is that it is undecidable to decide whether an SQL query is boundedly evaluable under an access schema A [7]. To cope with this, an effective syntax L was developed for boundedly evaluable RA queries[4]. That is, L is a class of RA queries such that under A,

(a) an RA query Q is boundedly evaluable if and only if it is equivalent to a query Q' in L; and

(b) it takes PTIME (polynomial time) in the size |Q| of Q and size |A| of A to check whether Q' is in L, reducing the problem to syntactic checking.

That is, L identifies the core subclass of boundedly evaluable RA queries, without sacrificing their expressive power. This is analogous to the study of safe relational calculus queries, which is also undecidable[14]. Based on the effective syntax, we can efficiently check whether a query Q is boundedly evaluable, and if so, generate a query plan to answer Q by accessing a bounded fraction DQ of D[4].

It has been shown that bounded evaluation can be readily built on top of commercial DBMS (database management systems) such as MySQL and PostgreSQL. It extends the DBMS with an immediate capability of querying big relations under constrained resources[4]. A prototype system was developed in[10]. We find that about 77% of SPC queries[6] and 67% of SQL queries[4] are boundedly evaluable. Better yet, more than 90% of the CDR queries at Huawei are boundedly evaluable.

3 Data Driven Approximation

For queries Q that are not boundedly evaluable, can we evaluate Q against a big dataset D under constrained resources? We answer the question in the affirmative.

We propose a data-driven scheme for approximate query answering. It is parameterized with a resource ratio α∈ (0, 1), indicating that our available resources can only access an α-fraction of big dataset D. Given α, D and a query Q over D, it identifies DQD, and computes Q(DQ) and an accuracy bound η∈ (0, 1] such that

(1) |DQ| ≤ α|D|, where |DQ| is measured in its number of tuples; and

(2) accuracy(Q(DQ), Q, D) ≥ η.

Intuitively, it computes approximate answers Q(DQ) by accessing at most α|D| tuples in the entire process. Thus it can scale with D when D grows big by setting α small. Moreover, Q(DQ) assure a deterministic accuracy bound η:

(a) for each approximate answer sQ(DQ), there exists an exact answer tQ(D) that is η-close to s, i.e., s is within distance η of t; and

(b) for each exact answer tQ(D), there exists an approximate answer sQ(DQ) that is η-close to t.

That is, Q(DQ) includes only “relevant” answers, and “covers” all exact answers. It finds sensible answers in users’ interest, and suffices for exploratory queries, e.g., real-time problem diagnosis on logs[15].

The objective is ambitious. As observed in[16], approximate query answering is challenging. Previous approaches often adopt an one-size-fit-all synopsis Dc and computes Q(Dc) for all queries Q posed on D. The approaches “substantially limit the types of queries they can execute”[15], and often focus on aggregate queries (max, min, avg, sum, count). Moreover, they make various assumptions on future queries, i.e., workloads, query predicates or QCSs, i.e., “the frequency of columns used for grouping and filtering does not change over time”. Worse still, they provide either no accuracy guarantee at all, or probabilistic error rates for aggregate queries only. Such error rates do not tell us how “good” each approximate answer is.

Nonetheless, data-driven approximation is feasible under access schema.

Example 2: Continuing with Example 1, consider query Q2 to find me hotels that cost at most $95 per night and are in a city where one of my friends lives: select h.address, h.price from poi as h, friend as f, person as p where f.pid = p0 and f.fid = p.pid and p.city = h.city and h.type = “hotel” and h.price ≤ 95 We can compute Q2(D0) in a big dataset D0 of trillions of tuples given a small α, e.g., 10-4, i.e., when our available resources can afford to access at most 10-4 * |D0| tuples. This is doable by using an access schema A0, which includes φ1 and φ2 of Example 1 and in addition, the following extended access constraints: ψ1: poi({type, city} → {price, address}, 1,(e1 p, e1 a)), . . . ψm: poi({type, city} → {price, address}, 2m,(em p, em a)), where m =$\left\lceil {{\log }_{2}}M \right\rceil $. Here M is the maximum number of distinct poi tuples in D0 grouped by (type, city). We build an index for each ψi in A0 such that for i∈[1, m], given any (type, city)-value (ct, cc), we can retrieve a set T of at most 2i(price, address) values from D0 by using the index for ψi; moreover, or each poi tuple (c°a, ct, cc, c° p) in D0, there exists (cp, ca)∈T such that the (price, address)-value (c° p, c° a) differs from (cp, ca) by distance at most (ei p, ei a). That is, T represents (price, address) values that correspond to (ct, cc) with at most 2i tuples, subject to distances(ei p, ei a). Intuitively, the indices give a hierarchical representation of relation poi with different resolutions i∈[1, m]. The higher the resolution i is, the smaller the distance (ei p, ei a) is, and the more accurate the index for ψi represents D0. Assume α|D0| > 10000, as in Facebook dataset D0. Then under A0, we can find hotels by accessing at most α|D0| tuples as follows: (a) fetch a set T1 of fid’s with p0 by accessing at most 5000 friend tuples using φ1; (b) for each fid in T1, fetch 1 associated city with φ2, yielding a set T2 of at most 5000 city values; (c) for each city c in T2, fetch at most 2ka (price, address) pairs corresponding to (“hotel”, c) by using ψ, where kα =$\left\lfloor {{\log }_{2}}(\alpha |{{D}_{0}}|-10000) \right\rfloor $; and (d) return a set S of those (price, address) values with price at most (95 +$e_{p}^{{{k}_{a}}}$), as approximate answers to Q2 in D0. The process accesses at most 5000 + 5000 +${{2}^{{{k}_{a}}}}$≤ α|D0| tuples in total. The set S of answers is accurate: (1) for each hotel h0(cp, ca) in the exact answers Q(D0), there exists (c° p, c° a) in S that are within$e_{p}^{{{k}_{a}}}$and$e_{a}^{{{k}_{a}}}$of cp and ca, respectively; and (2) for each hotel h'(c°p, c°a) in S, its price c°p exceeds 95 by at most$e_{p}^{{{k}_{a}}}$, e.g.,$e_{a}^{{{k}_{a}}}$=4 and c°p=99, and c°a is the address of hotel h°. Moreover, the larger α is, the smaller$e_{p}^{{{k}_{a}}}$and$e_{a}^{{{k}_{a}}}$are, and the more accurate S is. It has been shown[9] that for any dataset D, there exists such an access schema A such that D conforms to A, and for any resource ratio α∈(0, 1] and SQL queries Q over D, aggregate or not, there exists a dataset DQ$\subseteq \$D identified by reasoning about A and a deterministic accuracy bound η such that |DQ| ≤ α|D| and accuracy (Q(DQ), Q, D) ≥ η. Moreover, the larger α is, the higher η is.

As opposed to previous approximate query answering approaches, the data-driven approximation scheme is able to answer SQL queries Q that are (1) unpredictable, i.e., without assuming any prior knowledge about Q, (2) and generic, aggregate or not, with (3) deterministic accuracy η in terms of both relevance and coverage.

We find that the approximation scheme computes approximate answers to SQL queries, aggregate or not, with accuracy η ≥ 0.82 for SQL queries, even when α is as small as 5.5 × 10-4 [9]. That is, it reduces D of PB size to DQ of 550GB.

4 A Resource Bounded Query Evaluation Framework

We are now ready to present BEAS (Boundedly EvAluable Sql), a resource-bounded framework for querying big relations. For a big dataset D in an application, BEAS takes a resource ratio α∈(0, 1] as a parameter, and discovers an access schema A. Given an SQL query Q posed on D, BEAS works as follows:

(1) it checks whether Q is boundedly evaluable under A, i.e., exact answers Q(D) can be computed by accessing DQD such that |DQ| is independent of |D|;

(2) if so, it computes Q(D) by accessing a bounded fraction DQ of D;

(3) otherwise, BEAS identifies DQ with |DQ| ≤ α|D|, and computes Q(DQ) with a deterministic accuracy bound η, based on data-driven approximation.

That is, under the resource constraint α, BEAS computes exact answers Q(D) when possible, and approximate answers Q(DQ) otherwise with accuracy η.

As opposed to conventional DBMS, BEAS is unique in its ability to (1) comply with resource ratio α, i.e., it can scale with arbitrarily large datasets D by adjusting α based on available resources, (2) decide whether Q is boundedly evaluable, (3) answer unpredictable and generic SQL queries, aggregate or not, with deterministic accuracy η, and (4) be plugged into commercial DBMS and provide the DBMS with an immediate capacity to query big relations under constrained resources.

In light of these, BEAS is promising for providing small companies with the capability of big data analytics and hence, to benefit from big data services. It can also help big companies such as Huawei to reduce the cost and improve efficiency[11]. Moreover, parallel processing is not a silver bullet for big data analytics. Indeed, one might expect a parallel algorithm to have the parallel scalability, i.e., the algorithm would run faster given more processors. However, few algorithms in the literature have this performance guarantee. Worse yet, some computation problems are not parallel scalable, i.e., there exist no algorithms for them such that their running time can be substantially reduced by adding processors, no matter how many processors are used[17-18]. For such computation problems, bounded evaluation and data-driven approximation offer a feasible solution.

The idea of resource-bounded query answering is not limited to relations. It has been shown that bounded evaluation improves the performance of graph pattern matching via subgraph isomorphism, an intractable problem[1] that is widely used in social media marketing[19-20] and knowledge base expansion[21], by 4 orders of magnitude on average[22]. For personalized social search via subgraph isomorphism, data-driven approximation retains 100% accuracy (i.e., η = 1) when α is as small as 1.5 * 10-6 [23], i.e., when processing graphs G of 1PB, they access only 15GB of data, i.e., reducing G from PB to GB while retaining high accuracy!

Author Information and Contributions

Yang Cao is a post-doctoral research fellow in LFCS, the School of Informatics, University of Edinburgh, UK. He received his Ph.D from the University of Edinburgh in 2016, supervised by Prof. Wenfei Fan. He received his B.S. from Beihang University. He is the recipient of Best Paper Award for SIGMOD 2017, a Microsoft Asia Research Fellowship, and earlier awards including National No.1 of CUMCM and first prize of MCM. His current research interests include querying big data and graph pattern matching.

Yang Cao (yang.cao@ed.ac.uk, corresponding author) designed the research framework and carried out the system.

Professor Wenfei Fan is the Chair of Web Data Management in the School of Informatics, University of Edinburgh, UK, the director of Huawei-Edinburgh Joint Research Lab, the chief scientist of Beijing Advanced Innovation Center for Big Data and Brain Computing, and the director of the International Research Center on Big Data, Beihang University, China. He is a Fellow of the Royal Society of Edinburgh, UK, a Fellow of the ACM, a National Professor of the 1000-Talent Program and a Yangtze River Scholar, China. He received his Ph.D. from the University of Pennsylvania, and his MS and BS from Peking University. He is a recipient of the Alberto O. Mendelzon Test-of-Time Award of ACM PODS 2015 and 2010, ERC Advanced Fellowship in 2015, the Roger Needham Award in 2008 (UK), the Outstanding Overseas Young Scholar Award in 2003 (China), the Career Award in 2001 (USA), and several Best Paper Awards (SIGMOD 2017, VLDB 2010, ICDE 2007, and Computer Networks in 2002). His current research interests include database theory and systems, in particular big data, data quality, data integration, distributed query processing, query languages, recommender systems, social networks and Web services.

Wenfei Fan (wenfei@inf.ed.ac.uk) proposed the research problems, designed the research framework and wrote the manuscript.

Tengfei Yuan is a first year Ph.D. student in LFCS, the School of Informatics, University of Edinburgh, supervised by Prof. Wenfei Fan. He received his B.Eng. from Shandong University. He is currently working on the development of BEAS, a system for bounded evaluation of SQL queries.

Tengfei Yuan (tengfei.yuan@ed.ac.uk) implemented system and carried out the tests.

PDF下载数
RichHTML 浏览数

 相关文章:

Yang Cao
Wenfei Fan
Tengfei Yuan
 版权所有 © 2015 《数据分析与知识发现》编辑部 地址：北京市海淀区中关村北四环西路33号 邮编：100190 电话/传真：(010)82626611-6626，82624938 E-mail:jishu@mail.las.ac.cn