数据分析与知识发现 2017 , 1 (9): 1-7 https://doi.org/10.11925/infotech.2096-3467.2017.0723

专家视点

Is Big Data Analytics Beyond the Reach of Small Companies?

Yang Cao¹, Wenfei Fan¹^,², Tengfei Yuan¹

¹ University of Edinburgh, Edinburgh, UK
² Beihang University, Beijing, China

收稿日期: 2017-07-21

接受日期: 2017-07-28

网络出版日期: 2017-09-25

展开

Abstract

Big data analytics is often prohibitively costly. It is typically conducted by parallel processing with a cluster of machines, and is considered a privilege of big companies that can afford the resources. This position paper argues that big data analytics is accessible to small companies with constrained resources. As an evidence, we present BEAS, a framework for querying big relations with constrained resources, based on bounded evaluation and data-driven approximation.

Keywords： Big data analytics ; Bounded evaluation ; Data-driven approximation ; Constrained resources

PDF (0KB) 元数据多维度评价相关文章收藏文章

本文引用格式导出 EndNote Ris Bibtex

Yang Cao, Wenfei Fan, Tengfei Yuan. Is Big Data Analytics Beyond the Reach of Small Companies?[J]. 数据分析与知识发现, 2017, 1(9): 1-7 https://doi.org/10.11925/infotech.2096-3467.2017.0723

1 Introduction

Big data analytics is expensive. Textbooks tell us that a computation problem is tractable if there exists a polynomial-time algorithm for it, i.e., its cost can be expressed as a polynomial in the size n of the input^[1]. However, it is no longer the case when it comes to big data. When n is big, algorithms in O(n²) or even O(n) time may take too long to be practical. Indeed, assuming the largest Solid State Drives (SSD) with 12GB/s for read^[2], a linear scan of a dataset of 15TB takes more than 20 minutes. It easily takes hours to join tables with millions of tuples^[3]. In other words, many computation problems that are typically considered tractable may become infeasible in the context of big data.

One might be tempted to think that parallel computation could solve the problem, by adding more processors when needed. However, small businesses often have constrained resources and cannot afford large-scale parallel computation.

Is big data analytics a privilege of big companies? Is it beyond the reach of small companies that can only afford constrained resources?

We argue that big data analytics is possible under constrained resources. As an example, we consider relational query answering. Given an SQL query Q and a relational database D, it is to compute the answers Q(D) to Q in D. Relational data accounts for the majority of data in industry. Moreover, it is nontrivial to compute Q(D). Indeed, it is NP-complete to decide whether a given tuple t is in Q(D) when Q is a simple SPC query (selection, projection and Cartesian product), and is PSPACE-complete when Q is in relational algebra, both subsumed by SQL.

We propose BEAS, a new query evaluation paradigm to answer SQL queries under constrained resources. The idea is to make big data small, i.e., to reduce queries on big data to computation on small data. Underlying BEAS are two principled approaches: (1) bounded evaluation that computes exact answers by accessing a bounded amount of data when possible^[4-8], and (2) a data-driven approximation scheme that answers queries for which exact answers are beyond reach under bounded resources, and offers a deterministic accuracy bound^[9].

As proof of concept, we have developed a prototype system^[10] and evaluated it with call-detailed-record (CDR) queries at Huawei. We find that bounded evaluation improves the performance of CDR queries by orders of magnitude^[11], and is able to reduce big datasets from PB (10¹⁵) to GB (10⁹) in many cases.

Below we give a brief introduction to bounded evaluation (Section 2), followed by data-driven approximation scheme (Section 3). Putting these together, we present BEAS, our resource-bounded query evaluation framework (Section 4).

2 Bounded Evaluation

Given a query Q posed on a big dataset D, bounded evaluation^[7] aims to compute Q(D) by accessing only a bounded subset D_Q of D that includes necessary information for answering Q in D, instead of the entire D. To identify D_Q, it makes use of an access schema A, which is a set of access constraints, i.e., a combination of simple cardinality constraints and their associated indices.

Under access schema A, query Q is boundedly evaluable if for all datasets D that conform to A, there exists a fraction D_Q⊆ D such that

◦Q(D_Q) = Q(D), i.e., D_Q suffices for computing exact answers Q(D); and

◦the time for identifying D_Q and hence the size |D_Q| of D_Q are determined by Q and A only. That is, the cost of computing Q(D_Q) is independent of |D|.

Intuitively, Q(D) can be computed by accessing D_Q. We identify D_Q by reasoning about the cardinality constraints in A, and fetch it by using the indices in A.

Example 1: Consider a database schema R₀ consisting of three relations:

(a) person(pid, city), stating that pid lives in city,

(b) friend(pid, fid), saying that fid is a friend of pid, and

An example access schema A consists of the following two access constraints:

◦φ₁: friend(pid → fid, 5000),

◦φ₂: person(pid → city, 1).

Here φ₁ is a constraint imposed by Facebook^[12]: a limit of 5000 friends per person; and φ₂ states that each person lives in at most one city. An index is built for φ₁ such that given a pid, it returns all fids of pid from friend, i.e., φ₁ includes the cardinality constraint (5000 fids for each pid) and the index; similarly for φ₂.

Consider a query Q₁ to find the cities where my friends live, which is taken from Graph Search of Facebook^[13]. Written in SQL, Q₁ can be expressed as:

select p.city

from friend as f, person as p

where f.pid = p₀and f.fid = p.pid,

where p₀ indicates “me”. When an instance D₀ of R₀ is “big”, e.g., Facebook has billions of users and trillions of friend links^[12], it is costly to compute Q₁(D₀).

However, we can do better since Q₁ is boundedly evaluable under A: (a) we first identify and fetch at most 5000 fids for person p₀ from relation friend by using the index for φ1, and (b) for each fid fetched, we get her city by fetching 1 tuple from relation person via the index for φ₂. In total we fetch a set D_Q of 10,000 tuples, instead of trillions; it suffices to compute Q₁(D₀) by using D_Q only^[4].

The notion of access schema was proposed in^[8], and the foundation of bounded evaluation was established in^[5,7]. The theory of bounded evaluation was first evaluated using SPC queries^[6] and then extended to relational algebra(RA)^[4]. A challenge is that it is undecidable to decide whether an SQL query is boundedly evaluable under an access schema A ^[7]. To cope with this, an effective syntax L was developed for boundedly evaluable RA queries^[4]. That is, L is a class of RA queries such that under A,

(a) an RA query Q is boundedly evaluable if and only if it is equivalent to a query Q' in L; and

(b) it takes PTIME (polynomial time) in the size |Q| of Q and size |A| of A to check whether Q' is in L, reducing the problem to syntactic checking.

That is, L identifies the core subclass of boundedly evaluable RA queries, without sacrificing their expressive power. This is analogous to the study of safe relational calculus queries, which is also undecidable^[14]. Based on the effective syntax, we can efficiently check whether a query Q is boundedly evaluable, and if so, generate a query plan to answer Q by accessing a bounded fraction D_Q of D^[4].

It has been shown that bounded evaluation can be readily built on top of commercial DBMS (database management systems) such as MySQL and PostgreSQL. It extends the DBMS with an immediate capability of querying big relations under constrained resources^[4]. A prototype system was developed in^[10]. We find that about 77% of SPC queries^[6] and 67% of SQL queries^[4] are boundedly evaluable. Better yet, more than 90% of the CDR queries at Huawei are boundedly evaluable.

3 Data Driven Approximation

For queries Q that are not boundedly evaluable, can we evaluate Q against a big dataset D under constrained resources? We answer the question in the affirmative.

We propose a data-driven scheme for approximate query answering. It is parameterized with a resource ratio α∈ (0, 1), indicating that our available resources can only access an α-fraction of big dataset D. Given α, D and a query Q over D, it identifies D_Q⊆D, and computes Q(D_Q) and an accuracy bound η∈ (0, 1] such that

(1) |D_Q| ≤ α|D|, where |D_Q| is measured in its number of tuples; and

(2) accuracy(Q(D_Q), Q, D) ≥ η.

Intuitively, it computes approximate answers Q(D_Q) by accessing at most α|D| tuples in the entire process. Thus it can scale with D when D grows big by setting α small. Moreover, Q(D_Q) assure a deterministic accuracy bound η:

(a) for each approximate answer s∈Q(D_Q), there exists an exact answer t∈Q(D) that is η-close to s, i.e., s is within distance η of t; and

(b) for each exact answer t∈Q(D), there exists an approximate answer s∈Q(D_Q) that is η-close to t.

That is, Q(D_Q) includes only “relevant” answers, and “covers” all exact answers. It finds sensible answers in users’ interest, and suffices for exploratory queries, e.g., real-time problem diagnosis on logs^[15].

The objective is ambitious. As observed in^[16], approximate query answering is challenging. Previous approaches often adopt an one-size-fit-all synopsis D_c and computes Q(D_c) for all queries Q posed on D. The approaches “substantially limit the types of queries they can execute”^[15], and often focus on aggregate queries (max, min, avg, sum, count). Moreover, they make various assumptions on future queries, i.e., workloads, query predicates or QCSs, i.e., “the frequency of columns used for grouping and filtering does not change over time”. Worse still, they provide either no accuracy guarantee at all, or probabilistic error rates for aggregate queries only. Such error rates do not tell us how “good” each approximate answer is.

Nonetheless, data-driven approximation is feasible under access schema.

Example 2: Continuing with Example 1, consider query Q₂ to find me hotels that cost at most $95 per night and are in a city where one of my friends lives:

select h.address, h.price

from poi as h, friend as f, person as p

where f.pid = p₀ and f.fid = p.pid and p.city = h.city and h.type = “hotel” and h.price ≤ 95

We can compute Q₂(D₀) in a big dataset D₀ of trillions of tuples given a small α, e.g., 10^-4, i.e., when our available resources can afford to access at most 10^-4 * |D₀| tuples. This is doable by using an access schema A₀, which includes φ₁ and φ₂ of Example 1 and in addition, the following extended access constraints:

◦ψ₁: poi({type, city} → {price, address}, 1,(e1 p, e1 a)),

. . .

◦ψ_m: poi({type, city} → {price, address}, 2^m,(em p, em a)), where m =$\left\lceil {{\log }_{2}}M \right\rceil $.

Here M is the maximum number of distinct poi tuples in D₀ grouped by (type, city). We build an index for each ψ_i in A₀ such that for i∈[1, m], given any (type, city)-value (c_t, c_c), we can retrieve a set T of at most 2ⁱ(price, address) values from D₀ by using the index for ψ_i; moreover, or each poi tuple (c°a, c_t, c_c, c° p) in D₀, there exists (c_p, c_a)∈T such that the (price, address)-value (c° p, c° a) differs from (c_p, c_a) by distance at most (ei p, ei a). That is, T represents (price, address) values that correspond to (c_t, c_c) with at most 2ⁱ tuples, subject to distances(ei p, ei a). Intuitively, the indices give a hierarchical representation of relation poi with different resolutions i∈[1, m]. The higher the resolution i is, the smaller the distance (ei p, ei a) is, and the more accurate the index for ψ_i represents D₀.

Assume α|D₀| > 10000, as in Facebook dataset D₀. Then under A₀, we can find hotels by accessing at most α|D₀| tuples as follows: (a) fetch a set T₁ of fid’s with p₀ by accessing at most 5000 friend tuples using φ₁; (b) for each fid in T₁, fetch 1 associated city with φ₂, yielding a set T₂ of at most 5000 city values; (c) for each city c in T₂, fetch at most 2^ka

(price, address) pairs corresponding to (“hotel”, c) by using ψ_kα, where k_α =$\left\lfloor {{\log }_{2}}(\alpha |{{D}_{0}}|-10000) \right\rfloor $; and (d) return a set S of those (price, address) values with price at most (95 +$e_{p}^{{{k}_{a}}}$), as approximate answers to Q₂ in D₀. The process accesses at most 5000 + 5000 + ${{2}^{{{k}_{a}}}}$ ≤ α|D₀| tuples in total.

The set S of answers is accurate: (1) for each hotel h₀(c_p, c_a) in the exact answers Q(D₀), there exists (c° p, c° a) in S that are within $e_{p}^{{{k}_{a}}}$ and $e_{a}^{{{k}_{a}}}$ of c_p and c_a, respectively; and (2) for each hotel h'(c°p, c°a) in S, its price c°p exceeds 95 by at most $e_{p}^{{{k}_{a}}}$, e.g., $e_{a}^{{{k}_{a}}}$=4 and c°p=99, and c°a is the address of hotel h°. Moreover, the larger α is, the smaller $e_{p}^{{{k}_{a}}}$ and $e_{a}^{{{k}_{a}}}$ are, and the more accurate S is.

It has been shown^[9] that for any dataset D, there exists such an access schema A such that D conforms to A, and for any resource ratio α∈(0, 1] and SQL queries Q over D, aggregate or not, there exists a dataset D_{Q$\subseteq $}D identified by reasoning about A and a deterministic accuracy bound η such that |D_Q| ≤ α|D| and accuracy (Q(D_Q), Q, D) ≥ η. Moreover, the larger α is, the higher η is.

As opposed to previous approximate query answering approaches, the data-driven approximation scheme is able to answer SQL queries Q that are (1) unpredictable, i.e., without assuming any prior knowledge about Q, (2) and generic, aggregate or not, with (3) deterministic accuracy η in terms of both relevance and coverage.

We find that the approximation scheme computes approximate answers to SQL queries, aggregate or not, with accuracy η ≥ 0.82 for SQL queries, even when α is as small as 5.5 × 10^{-4 [9]}. That is, it reduces D of PB size to D_Q of 550GB.

4 A Resource Bounded Query Evaluation Framework

We are now ready to present BEAS (Boundedly EvAluable Sql), a resource-bounded framework for querying big relations. For a big dataset D in an application, BEAS takes a resource ratio α∈(0, 1] as a parameter, and discovers an access schema A. Given an SQL query Q posed on D, BEAS works as follows:

(1) it checks whether Q is boundedly evaluable under A, i.e., exact answers Q(D) can be computed by accessing D_Q⊆D such that |D_Q| is independent of |D|;

(2) if so, it computes Q(D) by accessing a bounded fraction D_Q of D;

(3) otherwise, BEAS identifies D_Q with |D_Q| ≤ α|D|, and computes Q(D_Q) with a deterministic accuracy bound η, based on data-driven approximation.

That is, under the resource constraint α, BEAS computes exact answers Q(D) when possible, and approximate answers Q(D_Q) otherwise with accuracy η.

As opposed to conventional DBMS, BEAS is unique in its ability to (1) comply with resource ratio α, i.e., it can scale with arbitrarily large datasets D by adjusting α based on available resources, (2) decide whether Q is boundedly evaluable, (3) answer unpredictable and generic SQL queries, aggregate or not, with deterministic accuracy η, and (4) be plugged into commercial DBMS and provide the DBMS with an immediate capacity to query big relations under constrained resources.

In light of these, BEAS is promising for providing small companies with the capability of big data analytics and hence, to benefit from big data services. It can also help big companies such as Huawei to reduce the cost and improve efficiency^[11]. Moreover, parallel processing is not a silver bullet for big data analytics. Indeed, one might expect a parallel algorithm to have the parallel scalability, i.e., the algorithm would run faster given more processors. However, few algorithms in the literature have this performance guarantee. Worse yet, some computation problems are not parallel scalable, i.e., there exist no algorithms for them such that their running time can be substantially reduced by adding processors, no matter how many processors are used^[17-18]. For such computation problems, bounded evaluation and data-driven approximation offer a feasible solution.

The idea of resource-bounded query answering is not limited to relations. It has been shown that bounded evaluation improves the performance of graph pattern matching via subgraph isomorphism, an intractable problem^[1] that is widely used in social media marketing^[19-20] and knowledge base expansion^[21], by 4 orders of magnitude on average^[22]. For personalized social search via subgraph isomorphism, data-driven approximation retains 100% accuracy (i.e., η = 1) when α is as small as 1.5 * 10^{-6 [23]}, i.e., when processing graphs G of 1PB, they access only 15GB of data, i.e., reducing G from PB to GB while retaining high accuracy!

Author Information and Contributions

Yang Cao is a post-doctoral research fellow in LFCS, the School of Informatics, University of Edinburgh, UK. He received his Ph.D from the University of Edinburgh in 2016, supervised by Prof. Wenfei Fan. He received his B.S. from Beihang University. He is the recipient of Best Paper Award for SIGMOD 2017, a Microsoft Asia Research Fellowship, and earlier awards including National No.1 of CUMCM and first prize of MCM. His current research interests include querying big data and graph pattern matching.

Yang Cao (yang.cao@ed.ac.uk, corresponding author) designed the research framework and carried out the system.

Professor Wenfei Fan is the Chair of Web Data Management in the School of Informatics, University of Edinburgh, UK, the director of Huawei-Edinburgh Joint Research Lab, the chief scientist of Beijing Advanced Innovation Center for Big Data and Brain Computing, and the director of the International Research Center on Big Data, Beihang University, China. He is a Fellow of the Royal Society of Edinburgh, UK, a Fellow of the ACM, a National Professor of the 1000-Talent Program and a Yangtze River Scholar, China. He received his Ph.D. from the University of Pennsylvania, and his MS and BS from Peking University. He is a recipient of the Alberto O. Mendelzon Test-of-Time Award of ACM PODS 2015 and 2010, ERC Advanced Fellowship in 2015, the Roger Needham Award in 2008 (UK), the Outstanding Overseas Young Scholar Award in 2003 (China), the Career Award in 2001 (USA), and several Best Paper Awards (SIGMOD 2017, VLDB 2010, ICDE 2007, and Computer Networks in 2002). His current research interests include database theory and systems, in particular big data, data quality, data integration, distributed query processing, query languages, recommender systems, social networks and Web services.

Wenfei Fan (wenfei@inf.ed.ac.uk) proposed the research problems, designed the research framework and wrote the manuscript.

Tengfei Yuan is a first year Ph.D. student in LFCS, the School of Informatics, University of Edinburgh, supervised by Prof. Wenfei Fan. He received his B.Eng. from Shandong University. He is currently working on the development of BEAS, a system for bounded evaluation of SQL queries.

Tengfei Yuan (tengfei.yuan@ed.ac.uk) implemented system and carried out the tests.

参考文献

原文顺序

文献年度倒序

文中引用次数倒序

被引期刊影响因子

[1]	Papadimitriou, C.H.: Computational Complexity. Addison-Wesley (1994). [本文引用: 2]
[2]	ITPro Newsletter: Samsung reveals’ largest ever’ SSD at 15TB (2016). URL [本文引用: 1]
[3]	Stack Overflow: SQL: Inner joining two massive tables. URL [本文引用: 1]
[4]	Cao, Y., Fan, W.: An effective syntax for bounded relational queries . In: SIGMOD (2016). . URL [本文引用: 7]
[5]	Cao Y., Fan W., Geerts F., Lu P.: Bounded query rewriting using views . In: PODS(2016). DOI: 10.1145/2902251.2902294. URL [本文引用: 1] 摘要 A query Q has a bounded rewriting using a set of views if there exists a query Q' expressed in the same language as Q, such that given a dataset D, Q(D) can be computed by Q' that accesses only cached views and a small fraction DQ of D. We consider datasets D that satisfy a set of access constraints, a combination of cardinality constraints and associated indices, such that the size \|DQ\| of DQ and the time to identify DQ are independent of \|D\|, no matter how big D is. This paper studies the problem for deciding whether a query has a bounded rewriting given a set V of views and a set A of access constraints. We establish the complexity of the problem for various query languages, from Σ3p-complete for conjunctive queries (CQ), to undecidable for relational algebra (FO). We show that the intractability for CQ is rather robust even for acyclic CQ with fixed V and A, and characterize when the problem is in PTIME. To make practical use of bounded rewriting, we provide an effective syntax for FO queries that have a bounded rewriting. The syntax characterizes a core subclass of such queries without sacrificing the expressive power, and can be checked in PTIME.
[6]	Cao Y., Fan W., Wo T., Yu, W.: Bounded conjunctive queries. PVLDB (. URL [本文引用: 2]
[7]	Fan W., Geerts F., Cao Y., Deng T.: Querying big data by accessing small data . In: PODS (2015). DOI:10.1145/2745754.2745771. URL [本文引用: 3] 摘要 This paper investigates the feasibility of querying big data by accessing a bounded amount of the data. We study boundedly evaluable queries under a form of access constraints, when their evaluation cost is determined by the queries and constraints only. While it is undecidable to determine whether FO queries are boundedly evaluable, we show that for several classes of FO queries, the bounded evaluability problem is decidable. We also provide characterization and effective syntax for their boundedly evaluable queries. When a query Q is not boundedly evaluable, we study two approaches to approximately answering Q under access constraints. (1) We search for upper and lower envelopes of Q that are boundedly evaluable and warrant a constant accuracy bound. (2) We instantiate a minimum set of variables (parameters) in Q such that the specialized query is boundedly evaluable. We study problems for deciding the existence of envelopes and bounded specialized queries, and establish their complexity for various classes of FO queries.
[8]	Fan W., Geerts F., Libkin L.: On scale independence for querying big data . In: PODS(2014). DOI:10.1145/2594538.2594551. URL [本文引用: 2] 摘要 To make query answering feasible in big datasets, practitioners have been looking into the notion of scale independence of queries. Intuitively, such queries require only a relatively small subset of the data, whose size is determined by the query and access methods rather than the size of the dataset itself. This paper aims to formalize this notion and study its properties. We start by defining what it means to be scale-independent, and provide matching upper and lower bounds for checking scale independence, for queries in various languages, and for combined and data complexity. Since the complexity turns out to be rather high, and since scale-independent queries cannot be captured syntactically, we develop sufficient conditions for scale independence. We formulate them based on access schemas, which combine indexing and constraints together with bounds on the sizes of retrieved data sets. We then study two variations of scale-independent query answering, inspired by existing practical systems. One concerns incremental query answering: we check when query answers can be maintained in response to updates scale-independently. The other explores scale-independent query rewriting using views.
[9]	Cao, Y., Fan, W.: Data driven approximation with bounded resources . PVLDB 10(9), 973-984 (. URL [本文引用: 3]
[10]	Cao Y., Fan W., Wang Y., Yuan T., Li Y., Chen L.Y.: BEAS: Bounded evaluation of SQL queries . In: SIGMOD, pp. 1667-1670 (2017). . URL [本文引用: 2]
[11]	University of Edinburgh: Huawei deal to advance expertise in data science (2017). URL [本文引用: 2]
[12]	Grujic I.,Bogdanovic-Dinic, S., Stoimenov, L.: Collecting and analyzing data from e-government Facebook pages . In: ICT Innovations (2014). URL [本文引用: 2] 摘要 Abstract Facebook has introduced a new ways for people to communicate and, at the same time, to share stories and experiences. Therefore, its usage in the area of e-Government becomes every day more and more important. In this paper we will give an overview of an application for collecting data from e-Government Facebook pages of Republic of Ser-bia which we have analyzed. Found results suggest in which way using Facebook we can improve interaction between citizens and e-Government which leads to maximal potential and usage of these social network.
[13]	Facebook: Introducing Graph Search. (2013). URL [本文引用: 1]
[14]	Abiteboul S., Hull R., Vianu, V.: Foundations of Databases. Addison-Wesley (). URL [本文引用: 1]
[15]	Agarwal S., Mozafari B., Panda A., Milner H., Madden S., Stoica I.: BlinkDB: Queries with bounded errors and bounded response times on very large data . In: EuroSys. 2013. . URL [本文引用: 2]
[16]	Chaudhuri S., Ding B., Kandula S.: Approximate query processing: No silver bullet . In: SIGMOD, pp. 511-519 (2017). DOI: 10.1145/3035918.3056097. URL [本文引用: 1] 摘要 Abstract In this paper, we reflect on the state of the art of Approximate Query Processing. Although much technical progress has been made in this area of research, we are yet to see its impact on products and services. We discuss two promising avenues to pursue towards integrating Approximate Query Processing into data platforms.
[17]	Fan W., Wang X., Wu, Y.: Distributed graph simulation: Impossibility and possibility . PVLDB 7(12) (2014). DOI: 10.14778/2732977.2732983. URL [本文引用: 1] 摘要 This paper studies fundamental problems for distributed graph simulation. Given a pattern query Q and a graph G that is fragmented and distributed, a graph simulation algorithm A is to compute the matches Q(G) of Q in G. We say that A is parallel scalable in (a) response time if its parallel computational cost is determined by the largest fragment Fm of G and the size \|Q\| of query Q, and (b) data shipment if its total amount of data shipped is determined by \|Q\| and the number of fragments of G, independent of the size of graph G. (1) We prove an impossibility theorem: there exists no distributed graph simulation algorithm that is parallel scalable in either response time or data shipment. (2)However, we show that distributed graph simulation is partition bounded, i.e., its response time depends only on \|Q\|,\|Fm\| and the number \|Vf \| of nodes in G with edges across different fragments; and its data shipment depends on \|Q\| and the number \|Ef \| of crossing edges only. We provide the first algorithms with these performance guarantees. (3) We also identify special cases of patterns and graphs when parallel scalability is possible. (4) We experimentally verify the scalability and efficiency of our algorithms.
[18]	Xie C., Chen R., Guan H., Zang B., Chen H.: SYNC or ASYNC: Time to fuse for distributed graph-parallel computation . In: PPoPP, pp. 194-204 (2015). [本文引用: 1]
[19]	Fan W., Wang X., Wu Y., Xu, J.: Association rules with graph patterns . PVLDB 8(12),1502-1513 (2015). DOI:10.14778/2824032.2824048. URL [本文引用: 1] 摘要 Abstract We propose graph-pattern association rules (GPARs) for so- cial media marketing. Extending association rules for item- sets, GPARs help us discover regularities between entities in social graphs, and identify potential customers by exploring social influence. We study the problem of discovering top- k diversified GPARs. While this problem is NP-hard, we develop a parallel algorithm with accuracy bound. We also study the problem of identifying potential customers with GPARs. While it is also NP-hard, we provide a parallel scal- able algorithm that guarantees a polynomial speedup over sequential algorithms with the increase of processors. Using real-life and synthetic graphs, we experimentally verify the scalability and effectiveness of the algorithms.
[20]	Fan W., Wu Y., Xu J.: Adding counting quantifiers to graph patterns . In: SIGMOD, pp. 1215-1230 (2016). DOI:10.1145/2882903.2882937. URL [本文引用: 1] 摘要 This paper proposes quantified graph patterns (QGPs), an extension of graph patterns by supporting simple counting quantifiers on edges. We show that QGPs naturally express universal and existential quantification, numeric and ratio aggregates, as well as negation. Better still, the increased expressivity does not come with a much higher price. We show that quantified matching, i.e., graph pattern matching with QGPs, remains NP-complete in the absence of negation, and is DP-complete for general QGPs. We show how quantified matching can be conducted by incorporating quantifier checking into conventional subgraph isomorphism methods. We also develop parallel scalable algorithms for quantified matching. As an application of QGPs, we introduce quantified graph association rules defined with QGPs, to identify potential customers in social media marketing. Using real-life and synthetic graphs, we experimentally verify the effectiveness of QGPs and the scalability of our algorithms.
[21]	Fan W., Wu Y., Xu J.: Functional dependencies for graphs . In: SIGMOD (2016). DOI: 10.1145/2882903.2915232. [本文引用: 1]
[22]	Cao Y., Fan W., Huang R.: Making pattern queries bounded in big graphs . In: ICDE(2015). DOI: 10.1109/ICDE.2015.7113281. URL [本文引用: 1] 摘要 It is cost-prohibitive to find matches Q(G) of a pattern query Q in a big graph G. We approach this by fetching a small subgraph GQ of G such that Q(GQ) = Q(G). We show that many practical patterns are effectively bounded under access constraints A commonly found in real life, such that GQ can be identified in time determined by Q and A only, independent of the size \|G\| of G. This holds no matter whether pattern queries are localized (e.g., via subgraph isomorphism) or non-localized (graph simulation). We provide algorithms to decide whether a pattern Q is effectively bounded, and if so, to generate a query plan that computes Q(G) by accessing GQ, in time independent of \|G\|. When Q is not effectively bounded, we give an algorithm to extend access constraints and make Q bounded in G. Using real-life data, we experimentally verify the effectiveness of the approach, e.g., about 60% of queries are effectively bounded for subgraph isomorphism, and for such queries our approach outperforms the conventional methods by 4 orders of magnitude.
[23]	Fan W., Wang X., Wu Y.: Querying big graphs within bounded resources . In: SIGMOD(2014). DOI: 10.1145/2588555.2610513. URL [本文引用: 1] 摘要 This paper studies the problem of querying graphs within bounded resources. Given a query Q, a graph G and a small ratio α, it aims to answer Q in G by accessing only a fraction GQ of G of size \|GQ\| ≤ α \|G\|. The need for this is evident when G is big while our available resources are limited, as indicated by α. We propose resource-bounded query answering via a dynamic scheme that reduces big G to GQ. We investigate when we can find the exact answers Q(G) from GQ, and if GQ cannot accommodate enough information, how accurate the approximate answers Q(GQ) are. To verify the effectiveness of the approach, we study two types of queries. One consists of pattern queries that have data locality, such as subgraph isomorphism and strong simulation. The other is the class of reachability queries, without data locality. We show that it is hard to get resource-bounded algorithms with 100% accuracy: NP-hard for pattern queries, and non-existing for reachability when α ≠ 1. Despite these, we develop resource-bounded algorithms for answering these queries. Using real-life and synthetic data, we experimentally evaluate the performance of the algorithms. We find that they scale well for both types of queries, and our approximate answers are accurate, even 100% for small α.

1994

... Big data analytics is expensive. Textbooks tell us that a computation problem is tractable if there exists a polynomial-time algorithm for it, i.e., its cost can be expressed as a polynomial in the size n of the input^[1]. However, it is no longer the case when it comes to big data. When n is big, algorithms in O(n²) or even O(n) time may take too long to be practical. Indeed, assuming the largest Solid State Drives (SSD) with 12GB/s for read^[2], a linear scan of a dataset of 15TB takes more than 20 minutes. It easily takes hours to join tables with millions of tuples^[3]. In other words, many computation problems that are typically considered tractable may become infeasible in the context of big data. ...

... The idea of resource-bounded query answering is not limited to relations. It has been shown that bounded evaluation improves the performance of graph pattern matching via subgraph isomorphism, an intractable problem^[1] that is widely used in social media marketing^[19-20] and knowledge base expansion^[21], by 4 orders of magnitude on average^[22]. For personalized social search via subgraph isomorphism, data-driven approximation retains 100% accuracy (i.e., η = 1) when α is as small as 1.5 * 10^{-6 [23]}, i.e., when processing graphs G of 1PB, they access only 15GB of data, i.e., reducing G from PB to GB while retaining high accuracy! ...

2016

An effective syntax for bounded relational queries

2016

... We propose BEAS, a new query evaluation paradigm to answer SQL queries under constrained resources. The idea is to make big data small, i.e., to reduce queries on big data to computation on small data. Underlying BEAS are two principled approaches: (1) bounded evaluation that computes exact answers by accessing a bounded amount of data when possible^[4-8], and (2) a data-driven approximation scheme that answers queries for which exact answers are beyond reach under bounded resources, and offers a deterministic accuracy bound^[9]. ...

... However, we can do better since Q₁ is boundedly evaluable under A: (a) we first identify and fetch at most 5000 fids for person p₀ from relation friend by using the index for φ1, and (b) for each fid fetched, we get her city by fetching 1 tuple from relation person via the index for φ₂. In total we fetch a set D_Q of 10,000 tuples, instead of trillions; it suffices to compute Q₁(D₀) by using D_Q only^[4]. ...

... The notion of access schema was proposed in^[8], and the foundation of bounded evaluation was established in^[5,7]. The theory of bounded evaluation was first evaluated using SPC queries^[6] and then extended to relational algebra(RA)^[4]. A challenge is that it is undecidable to decide whether an SQL query is boundedly evaluable under an access schema A ^[7]. To cope with this, an effective syntax L was developed for boundedly evaluable RA queries^[4]. That is, L is a class of RA queries such that under A, ...

... [4]. That is, L is a class of RA queries such that under A, ...

... That is, L identifies the core subclass of boundedly evaluable RA queries, without sacrificing their expressive power. This is analogous to the study of safe relational calculus queries, which is also undecidable^[14]. Based on the effective syntax, we can efficiently check whether a query Q is boundedly evaluable, and if so, generate a query plan to answer Q by accessing a bounded fraction D_Q of D^[4]. ...

... It has been shown that bounded evaluation can be readily built on top of commercial DBMS (database management systems) such as MySQL and PostgreSQL. It extends the DBMS with an immediate capability of querying big relations under constrained resources^[4]. A prototype system was developed in^[10]. We find that about 77% of SPC queries^[6] and 67% of SQL queries^[4] are boundedly evaluable. Better yet, more than 90% of the CDR queries at Huawei are boundedly evaluable. ...

... [4] are boundedly evaluable. Better yet, more than 90% of the CDR queries at Huawei are boundedly evaluable. ...

Bounded query rewriting using views

2016

Querying big data by accessing small data

2015

... Given a query Q posed on a big dataset D, bounded evaluation^[7] aims to compute Q(D) by accessing only a bounded subset D_Q of D that includes necessary information for answering Q in D, instead of the entire D. To identify D_Q, it makes use of an access schema A, which is a set of access constraints, i.e., a combination of simple cardinality constraints and their associated indices. ...

... [7]. To cope with this, an effective syntax L was developed for boundedly evaluable RA queries^[4]. That is, L is a class of RA queries such that under A, ...

On scale independence for querying big data

2014

Data driven approximation with bounded resources

... It has been shown^[9] that for any dataset D, there exists such an access schema A such that D conforms to A, and for any resource ratio α∈(0, 1] and SQL queries Q over D, aggregate or not, there exists a dataset D_{Q$\subseteq $}D identified by reasoning about A and a deterministic accuracy bound η such that |D_Q| ≤ α|D| and accuracy (Q(D_Q), Q, D) ≥ η. Moreover, the larger α is, the higher η is. ...

... We find that the approximation scheme computes approximate answers to SQL queries, aggregate or not, with accuracy η ≥ 0.82 for SQL queries, even when α is as small as 5.5 × 10^{-4 [9]}. That is, it reduces D of PB size to D_Q of 550GB. ...

BEAS: Bounded evaluation of SQL queries

2017

... As proof of concept, we have developed a prototype system^[10] and evaluated it with call-detailed-record (CDR) queries at Huawei. We find that bounded evaluation improves the performance of CDR queries by orders of magnitude^[11], and is able to reduce big datasets from PB (10¹⁵) to GB (10⁹) in many cases. ...

2017

... In light of these, BEAS is promising for providing small companies with the capability of big data analytics and hence, to benefit from big data services. It can also help big companies such as Huawei to reduce the cost and improve efficiency^[11]. Moreover, parallel processing is not a silver bullet for big data analytics. Indeed, one might expect a parallel algorithm to have the parallel scalability, i.e., the algorithm would run faster given more processors. However, few algorithms in the literature have this performance guarantee. Worse yet, some computation problems are not parallel scalable, i.e., there exist no algorithms for them such that their running time can be substantially reduced by adding processors, no matter how many processors are used^[17-18]. For such computation problems, bounded evaluation and data-driven approximation offer a feasible solution. ...

Collecting and analyzing data from e-government Facebook pages

2014

... Here φ₁ is a constraint imposed by Facebook^[12]: a limit of 5000 friends per person; and φ₂ states that each person lives in at most one city. An index is built for φ₁ such that given a pid, it returns all fids of pid from friend, i.e., φ₁ includes the cardinality constraint (5000 fids for each pid) and the index; similarly for φ₂. ...

... where p₀ indicates “me”. When an instance D₀ of R₀ is “big”, e.g., Facebook has billions of users and trillions of friend links^[12], it is costly to compute Q₁(D₀). ...

2013

... Consider a query Q₁ to find the cities where my friends live, which is taken from Graph Search of Facebook^[13]. Written in SQL, Q₁ can be expressed as: ...

BlinkDB: Queries with bounded errors and bounded response times on very large data

2013

... That is, Q(D_Q) includes only “relevant” answers, and “covers” all exact answers. It finds sensible answers in users’ interest, and suffices for exploratory queries, e.g., real-time problem diagnosis on logs^[15]. ...

... The objective is ambitious. As observed in^[16], approximate query answering is challenging. Previous approaches often adopt an one-size-fit-all synopsis D_c and computes Q(D_c) for all queries Q posed on D. The approaches “substantially limit the types of queries they can execute”^[15], and often focus on aggregate queries (max, min, avg, sum, count). Moreover, they make various assumptions on future queries, i.e., workloads, query predicates or QCSs, i.e., “the frequency of columns used for grouping and filtering does not change over time”. Worse still, they provide either no accuracy guarantee at all, or probabilistic error rates for aggregate queries only. Such error rates do not tell us how “good” each approximate answer is. ...

Approximate query processing: No silver bullet

2017

Distributed graph simulation: Impossibility and possibility

2014

SYNC or ASYNC: Time to fuse for distributed graph-parallel computation

2015

Association rules with graph patterns

2015

Adding counting quantifiers to graph patterns

2016

Functional dependencies for graphs

2016

Making pattern queries bounded in big graphs

2015

Querying big graphs within bounded resources

2014

〈

〉

检索词推荐：

Is Big Data Analytics Beyond the Reach of Small Companies?

1 Introduction

2 Bounded Evaluation

3 Data Driven Approximation

4 A Resource Bounded Query Evaluation Framework

Author Information and Contributions

参考文献 文献选项 原文顺序 文献年度倒序 文中引用次数倒序 被引期刊影响因子

参考文献

原文顺序

文献年度倒序

文中引用次数倒序

被引期刊影响因子