Big data analytics is often prohibitively costly. It is typically conducted by parallel processing with a cluster of machines, and is considered a privilege of big companies that can afford the resources. This position paper argues that big data analytics is accessible to small companies with constrained resources. As an evidence, we present BEAS, a framework for querying big relations with constrained resources, based on bounded evaluation and data-driven approximation.
Big data analytics is expensive. Textbooks tell us that a computation problem is
One might be tempted to think that parallel computation could solve the problem, by adding more processors when needed. However, small businesses often have constrained resources and cannot afford large-scale parallel computation.
Is big data analytics a privilege of big companies? Is it beyond the reach of small companies that can only afford constrained resources?
We argue that big data analytics is possible under constrained resources. As an example, we consider relational query answering. Given an SQL query
We propose BEAS, a new query evaluation paradigm to answer SQL queries under constrained resources. The idea is to make big data small,
As proof of concept, we have developed a prototype system and evaluated it with call-detailed-record (CDR) queries at Huawei. We find that bounded evaluation improves the performance of CDR queries by orders of magnitude, and is able to reduce big datasets from PB (1015) to GB (109) in many cases.
Below we give a brief introduction to bounded evaluation (Section 2), followed by data-driven approximation scheme (Section 3). Putting these together, we present BEAS, our resource-bounded query evaluation framework (Section 4).
Given a query
Under access schema
◦the time for identifying
Example 1: Consider a database schema
(a) person(pid, city), stating that pid lives in city,
(b) friend(pid, fid), saying that fid is a friend of pid, and
(c) poi(address, type, city, price), for the type, price and city of points of interest.
An example access schema
Consider a query
from friend as f, person as p
where f.pid =
However, we can do better since
The notion of access schema was proposed in, and the foundation of bounded evaluation was established in[5,7]. The theory of bounded evaluation was first evaluated using SPC queries and then extended to relational algebra(RA). A challenge is that it is undecidable to decide whether an SQL query is boundedly evaluable under an access schema
(a) an RA query
(b) it takes PTIME (polynomial time) in the size |
It has been shown that bounded evaluation can be readily built on top of commercial DBMS (database management systems) such as MySQL and PostgreSQL. It extends the DBMS with an immediate capability of querying big relations under constrained resources. A prototype system was developed in. We find that about 77% of SPC queries and 67% of SQL queries are boundedly evaluable. Better yet, more than 90% of the CDR queries at Huawei are boundedly evaluable.
We propose a data-driven scheme for approximate query answering. It is parameterized with a
Intuitively, it computes approximate answers
(b) for each exact answer
The objective is ambitious. As observed in, approximate query answering is challenging. Previous approaches often adopt an
Nonetheless, data-driven approximation is feasible under access schema.
Example 2: Continuing with Example 1, consider query
select h.address, h.price
from poi as h, friend as f, person as p
where f.pid =
We can compute
. . .
(price, address) pairs corresponding to (“hotel”,
It has been shown that for any dataset
As opposed to previous approximate query answering approaches, the data-driven approximation scheme is able to answer SQL queries
We find that the approximation scheme computes approximate answers to SQL queries, aggregate or not, with accuracy
We are now ready to present BEAS (Boundedly EvAluable Sql), a resource-bounded framework for querying big relations. For a big dataset
(1) it checks whether
(2) if so, it computes
(3) otherwise, BEAS identifies
That is, under the resource constraint
As opposed to conventional DBMS, BEAS is unique in its ability to (1) comply with
In light of these, BEAS is promising for providing small companies with the capability of big data analytics and hence, to benefit from big data services. It can also help big companies such as Huawei to reduce the cost and improve efficiency. Moreover, parallel processing is not a silver bullet for big data analytics. Indeed, one might expect a parallel algorithm to have the
The idea of resource-bounded query answering is not limited to relations. It has been shown that bounded evaluation improves the performance of graph pattern matching via subgraph isomorphism, an intractable problem that is widely used in social media marketing[19-20] and knowledge base expansion, by 4 orders of magnitude on average. For personalized social search via subgraph isomorphism, data-driven approximation retains 100% accuracy (
Yang Cao is a post-doctoral research fellow in LFCS, the School of Informatics, University of Edinburgh, UK. He received his Ph.D from the University of Edinburgh in 2016, supervised by Prof. Wenfei Fan. He received his B.S. from Beihang University. He is the recipient of Best Paper Award for SIGMOD 2017, a Microsoft Asia Research Fellowship, and earlier awards including National No.1 of CUMCM and first prize of MCM. His current research interests include querying big data and graph pattern matching.
Yang Cao (firstname.lastname@example.org, corresponding author) designed the research framework and carried out the system.
Professor Wenfei Fan is the Chair of Web Data Management in the School of Informatics, University of Edinburgh, UK, the director of Huawei-Edinburgh Joint Research Lab, the chief scientist of Beijing Advanced Innovation Center for Big Data and Brain Computing, and the director of the International Research Center on Big Data, Beihang University, China. He is a Fellow of the Royal Society of Edinburgh, UK, a Fellow of the ACM, a National Professor of the 1000-Talent Program and a Yangtze River Scholar, China. He received his Ph.D. from the University of Pennsylvania, and his MS and BS from Peking University. He is a recipient of the Alberto O. Mendelzon Test-of-Time Award of ACM PODS 2015 and 2010, ERC Advanced Fellowship in 2015, the Roger Needham Award in 2008 (UK), the Outstanding Overseas Young Scholar Award in 2003 (China), the Career Award in 2001 (USA), and several Best Paper Awards (SIGMOD 2017, VLDB 2010, ICDE 2007, and Computer Networks in 2002). His current research interests include database theory and systems, in particular big data, data quality, data integration, distributed query processing, query languages, recommender systems, social networks and Web services.
Wenfei Fan (email@example.com) proposed the research problems, designed the research framework and wrote the manuscript.
Tengfei Yuan is a first year Ph.D. student in LFCS, the School of Informatics, University of Edinburgh, supervised by Prof. Wenfei Fan. He received his B.Eng. from Shandong University. He is currently working on the development of BEAS, a system for bounded evaluation of SQL queries.
Tengfei Yuan (firstname.lastname@example.org) implemented system and carried out the tests.
A query Q has a bounded rewriting using a set of views if there exists a query Q' expressed in the same language as Q, such that given a dataset D, Q(D) can be computed by Q' that accesses only cached views and a small fraction DQ of D. We consider datasets D that satisfy a set of access constraints, a combination of cardinality constraints and associated indices, such that the size |DQ| of DQ and the time to identify DQ are independent of |D|, no matter how big D is. This paper studies the problem for deciding whether a query has a bounded rewriting given a set V of views and a set A of access constraints. We establish the complexity of the problem for various query languages, from Σ3p-complete for conjunctive queries (CQ), to undecidable for relational algebra (FO). We show that the intractability for CQ is rather robust even for acyclic CQ with fixed V and A, and characterize when the problem is in PTIME. To make practical use of bounded rewriting, we provide an effective syntax for FO queries that have a bounded rewriting. The syntax characterizes a core subclass of such queries without sacrificing the expressive power, and can be checked in PTIME.
This paper investigates the feasibility of querying big data by accessing a bounded amount of the data. We study boundedly evaluable queries under a form of access constraints, when their evaluation cost is determined by the queries and constraints only. While it is undecidable to determine whether FO queries are boundedly evaluable, we show that for several classes of FO queries, the bounded evaluability problem is decidable. We also provide characterization and effective syntax for their boundedly evaluable queries. When a query Q is not boundedly evaluable, we study two approaches to approximately answering Q under access constraints. (1) We search for upper and lower envelopes of Q that are boundedly evaluable and warrant a constant accuracy bound. (2) We instantiate a minimum set of variables (parameters) in Q such that the specialized query is boundedly evaluable. We study problems for deciding the existence of envelopes and bounded specialized queries, and establish their complexity for various classes of FO queries.
To make query answering feasible in big datasets, practitioners have been looking into the notion of scale independence of queries. Intuitively, such queries require only a relatively small subset of the data, whose size is determined by the query and access methods rather than the size of the dataset itself. This paper aims to formalize this notion and study its properties. We start by defining what it means to be scale-independent, and provide matching upper and lower bounds for checking scale independence, for queries in various languages, and for combined and data complexity. Since the complexity turns out to be rather high, and since scale-independent queries cannot be captured syntactically, we develop sufficient conditions for scale independence. We formulate them based on access schemas, which combine indexing and constraints together with bounds on the sizes of retrieved data sets. We then study two variations of scale-independent query answering, inspired by existing practical systems. One concerns incremental query answering: we check when query answers can be maintained in response to updates scale-independently. The other explores scale-independent query rewriting using views.
Abstract Facebook has introduced a new ways for people to communicate and, at the same time, to share stories and experiences. Therefore, its usage in the area of e-Government becomes every day more and more important. In this paper we will give an overview of an application for collecting data from e-Government Facebook pages of Republic of Ser-bia which we have analyzed. Found results suggest in which way using Facebook we can improve interaction between citizens and e-Government which leads to maximal potential and usage of these social network.
Abstract In this paper, we reflect on the state of the art of Approximate Query Processing. Although much technical progress has been made in this area of research, we are yet to see its impact on products and services. We discuss two promising avenues to pursue towards integrating Approximate Query Processing into data platforms.
This paper studies fundamental problems for distributed graph simulation. Given a pattern query Q and a graph G that is fragmented and distributed, a graph simulation algorithm A is to compute the matches Q(G) of Q in G. We say that A is parallel scalable in (a) response time if its parallel computational cost is determined by the largest fragment Fm of G and the size |Q| of query Q, and (b) data shipment if its total amount of data shipped is determined by |Q| and the number of fragments of G, independent of the size of graph G. (1) We prove an impossibility theorem: there exists no distributed graph simulation algorithm that is parallel scalable in either response time or data shipment. (2)However, we show that distributed graph simulation is partition bounded, i.e., its response time depends only on |Q|,|Fm| and the number |Vf | of nodes in G with edges across different fragments; and its data shipment depends on |Q| and the number |Ef | of crossing edges only. We provide the first algorithms with these performance guarantees. (3) We also identify special cases of patterns and graphs when parallel scalability is possible. (4) We experimentally verify the scalability and efficiency of our algorithms.
Abstract We propose graph-pattern association rules (GPARs) for so- cial media marketing. Extending association rules for item- sets, GPARs help us discover regularities between entities in social graphs, and identify potential customers by exploring social influence. We study the problem of discovering top- k diversified GPARs. While this problem is NP-hard, we develop a parallel algorithm with accuracy bound. We also study the problem of identifying potential customers with GPARs. While it is also NP-hard, we provide a parallel scal- able algorithm that guarantees a polynomial speedup over sequential algorithms with the increase of processors. Using real-life and synthetic graphs, we experimentally verify the scalability and effectiveness of the algorithms.
This paper proposes quantified graph patterns (QGPs), an extension of graph patterns by supporting simple counting quantifiers on edges. We show that QGPs naturally express universal and existential quantification, numeric and ratio aggregates, as well as negation. Better still, the increased expressivity does not come with a much higher price. We show that quantified matching, i.e., graph pattern matching with QGPs, remains NP-complete in the absence of negation, and is DP-complete for general QGPs. We show how quantified matching can be conducted by incorporating quantifier checking into conventional subgraph isomorphism methods. We also develop parallel scalable algorithms for quantified matching. As an application of QGPs, we introduce quantified graph association rules defined with QGPs, to identify potential customers in social media marketing. Using real-life and synthetic graphs, we experimentally verify the effectiveness of QGPs and the scalability of our algorithms.
It is cost-prohibitive to find matches Q(G) of a pattern query Q in a big graph G. We approach this by fetching a small subgraph GQ of G such that Q(GQ) = Q(G). We show that many practical patterns are effectively bounded under access constraints A commonly found in real life, such that GQ can be identified in time determined by Q and A only, independent of the size |G| of G. This holds no matter whether pattern queries are localized (e.g., via subgraph isomorphism) or non-localized (graph simulation). We provide algorithms to decide whether a pattern Q is effectively bounded, and if so, to generate a query plan that computes Q(G) by accessing GQ, in time independent of |G|. When Q is not effectively bounded, we give an algorithm to extend access constraints and make Q bounded in G. Using real-life data, we experimentally verify the effectiveness of the approach, e.g., about 60% of queries are effectively bounded for subgraph isomorphism, and for such queries our approach outperforms the conventional methods by 4 orders of magnitude.
This paper studies the problem of querying graphs within bounded resources. Given a query Q, a graph G and a small ratio α, it aims to answer Q in G by accessing only a fraction GQ of G of size |GQ| ≤ α |G|. The need for this is evident when G is big while our available resources are limited, as indicated by α. We propose resource-bounded query answering via a dynamic scheme that reduces big G to GQ. We investigate when we can find the exact answers Q(G) from GQ, and if GQ cannot accommodate enough information, how accurate the approximate answers Q(GQ) are. To verify the effectiveness of the approach, we study two types of queries. One consists of pattern queries that have data locality, such as subgraph isomorphism and strong simulation. The other is the class of reachability queries, without data locality. We show that it is hard to get resource-bounded algorithms with 100% accuracy: NP-hard for pattern queries, and non-existing for reachability when α ≠ 1. Despite these, we develop resource-bounded algorithms for answering these queries. Using real-life and synthetic data, we experimentally evaluate the performance of the algorithms. We find that they scale well for both types of queries, and our approximate answers are accurate, even 100% for small α.
|版权所有 © 2015 《数据分析与知识发现》编辑部