|
|
Big Data Technology Stack Shifting: From SQL Centric to Graph Centric |
Shen Zhihong1( ),Zhao Zihao1,2,Wang Haibo1 |
1Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China 2University of Chinese Academy of Sciences, Beijing 100049, China |
|
|
Abstract [Objective] The traditional SQL centric technology stack cannot handle multivariant and heterogeneous data management, large-scale network management, as well as complex network analysis. Therefore, we proposed a new graphic centric technology stack for big data.[Methods] First, we analyzed the advantages of graph-based data model and established a new graph centric technology stack. Then, we developed PandaDB, an intelligent fusion data management system.[Results] The new technology stack performed well in the applications of biological data network and scholar knowledge graph. PandaDB could manage structured and unstructured data fusion.[Limitations] It is difficult to further promote this technology stack due to the lack of supporting tools and complete application ecology.[Conclusions] Our new technology stack will play a greater role in big data applications.
|
Received: 20 May 2020
Published: 25 July 2020
|
|
Corresponding Authors:
Shen Zhihong
E-mail: bluejoe@cnic.cn
|
[1] |
Codd E F. Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks[J]. ACM SIGMOD Record, 2009,38(1):17-36.
|
[2] |
Codd E F, Codd S B, Salley C T. Providing OLAP (On-Line Analytical Processing) to User-analysts. An IT Mandate[R]. White Paper. Arbor Software Corporation, 1993.
|
[3] |
The Kettle Open Source Data Integration Project[EB/OL]. [ 2020- 04- 02]. http://www.kettle.be/.
|
[4] |
Talend - A Cloud Data Integration Leader (modern ETL) [EB/OL]. [2020- 04- 02]. https://www.talend.com/.
|
[5] |
Enterprise Cloud Data Management | Informatica [EB/OL]. [2020- 04- 02]. https://www.informatica.com/.
|
[6] |
DataX[EB/OL]. [2020- 04- 02]. https://github.com/alibaba/DataX.
|
[7] |
Oracle GoldenGate [EB/OL]. [2020- 04- 02]. https://www.oracle.com/middleware/technologies/goldengate.html.
|
[8] |
Thusoo A, Sarma J S, Jain N, et al. Hive: A Warehousing Solution over a Map-reduce Framework[J]. Proceedings of the VLDB Endowment, 2009,2(2):1626-1629.
|
[9] |
Kornacker M, Behm A, Bittorf V, et al. Impala: A Modern, Open-Source SQL Engine for Hadoop[C] //Proceedings of the 7th Biennial Conference on Innovative Data Systems Research(CIDR’15). 2015.
|
[10] |
Akhtar S, Magham R. Pro Apache Phoenix: An SQL Driver for HBase[M]. Apress, 2016.
|
[11] |
SQL Interface for Solr Cloud[EB/OL]. [2020- 04- 02]. https://github.com/bluejoe2008/solr-sql.
|
[12] |
Begoli E, Camacho-Rodríguez J, Hyde J, et al. Apache Calcite: A Foundational Framework for Optimized Query Processing over Heterogeneous Data Sources[C] //Proceedings of the 2018 International Conference on Management of Data. 2018: 221-230.
|
[13] |
Arasu A, Babu S, Widom J. The CQL Continuous Query Language: Semantic Foundations and Query Execution[J]. The VLDB Journal, 2006,15(2):121-142.
|
[14] |
Cai L, Chen J J, Chen J, et al. Fusion Insight LibrA: Huawei’s Enterprise Cloud Data Analytics Platform[J]. Proceedings of the VLDB Endowment, 2018,11(12):1822-1834.
|
[15] |
Apache Kylin | Analytical Data Warehouse for Big Data [EB/OL]. [ 2020- 04- 02]. http://kylin.apache.org/cn/.
|
[16] |
Fernandes S, Bernardino J. What is BigQuery?[C] //Proceedings of the 19th International Database Engineering & Applications Symposium. 2015: 202-203.
|
[17] |
BigQuery ML[EB/OL]. [2020- 04- 02]. https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro.
|
[18] |
Wang Y, Yang Y, Zhu W G, et al. SQLFlow: A Bridge Between SQL and Machine Learning[OL]. arXiv Preprint, arXiv:2001. 06846.
|
[19] |
Katal A, Wazid M, Goudar R H. Big Data: Issues, Challenges, Tools and Good Practices[C] //Proceedings of the 6th International Conference on Contemporary Computing (IC3). IEEE, 2013: 404-409.
|
[20] |
Gaag A, Kohn A, Lindemann U. Function-based Solution Retrieval and Semantic Search in Mechanical Engineering[C] //Proceedings the 17th International Conference on Engineering Design (ICED 09). 2009: 147-158.
|
[21] |
Cattell R. Scalable SQL and NoSQL Data Stores[J]. ACM SIGMOD Record, 2010,39(4):12-27.
|
[22] |
Wang G X, Tang J F. The NoSQL Principles and Basic Application of Cassandra Model[C] //Proceedings of International Conference on Computer Science & Service System (CSSS). 2012: 1332-1335.
|
[23] |
Eric B. CAP Twelve Years Later: How the “Rules” Have Changed[J]. Computer, 2012,45(2):23-29.
|
[24] |
Lu J H, Holubová I. Multi-model Data Management: What's New and What's Next?[C] //Proceedings of the 20th International Conference on Extending Database Technology(EDBT). 2017: 602-605.
|
[25] |
Kiran M, Murphy P, Monga I, et al. Lambda Architecture for Cost-effective Batch and Speed Big Data Processing[C] //Proceedings of 2015 IEEE International Conference on Big Data. IEEE, 2015: 2785-2792.
|
[26] |
Questioning the Lamba Architecture [EB/OL]. [ 2020- 04- 02]. http://radar.oreilly.com/2014/07/questioning-the-lambdaarchitecture.html.
|
[27] |
Duggan J, Elmore A J, Stonebraker M, et al. The BigDAWG Polystore System[J]. ACM SIGMOD Record, 2015,44(2):11-16.
|
[28] |
Kwak H, Lee C H, Park H S, et al. What is Twitter, a Social Network or a News Media?[C] //Proceedings of the 19th International Conference on World Wide Web. 2010: 591-600.
|
[29] |
Backstrom L, Boldi P, Rosa M, et al. Four Degrees of Separation[C] //Proceedings of the 4th Annual ACM Web Science Conference. 2012: 45-54.
|
[30] |
漆桂林, 高桓, 吴天星. 知识图谱研究进展[J]. 情报工程, 2017,3(1):4-25.
|
[30] |
( Qi Guilin, Gao Huan, Wu Tianxing. The Research Advances of Knowledge Graph[J]. Technology Intelligence Engineering, 2017,3(1):4-25.)
|
[31] |
SN SciGraph-A Linked Open Data Platform for the Scholarly Domain[EB/OL]. [2020-04-02]. https://www.springernature.com/gp/researchers/scigraph.
|
[32] |
Zhang F J, Liu X, Tang J, et al. OAG: Toward Linking Large-scale Heterogeneous Entity Graphs[C] //Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019: 2585-2595.
|
[33] |
Stanford Large Network Dataset Collection [EB/OL]. [2020-04-02]. https://snap.stanford.edu/data/.
|
[34] |
BigDND: Big Dynamic Network Data [EB/OL]. [ 2020- 04- 02]. http://projects.csail.mit.edu/dnd/.
|
[35] |
Tang J. AMiner: Toward Understanding Big Scholar Data[C] //Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 2016: 467.
|
[36] |
汪涛, 蒋庆华, 彭佳杰, 等. 基因共表达网络的构建及分析方法研究综述[J]. 智能计算机与应用, 2014,4(6):47-50,53.
|
[36] |
( Wang Tao, Jiang Qinghua, Peng Jiajie, et al. A Review of the Construction and Analysis of Gene Co-expression Network[J]. Intelligent Computer and Applications, 2014,4(6):47-50, 53.)
|
[37] |
黎建辉, 沈志宏, 孟小峰. 科学大数据管理:概念、技术与系统[J]. 计算机研究与发展, 2017,54(2):235-247.
|
[37] |
( Li Jianhui, Shen Zhihong, Meng Xiaofeng. Scientific Big Data Management: Concepts,Technologies and System[J]. Journal of Computer Research and Development, 2017,54(2):235-247.)
|
[38] |
Liu Z Q, Chen C C, Yang X X, et al. Heterogeneous Graph Neural Networks for Malicious Account Detection[C] //Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018: 2077-2085.
|
[39] |
Jin J G, Tang L C, Sun L, et al. Enhancing Metro Network Resilience via Localized Integration with Bus Services[J]. Transportation Research Part E: Logistics and Transportation Review, 2014,63:17-30.
|
[40] |
Seriani S, Fernandez R. Planning Guidelines for Metro-bus Interchanges by Means of a Pedestrian Microsimulation Model[J]. Transportation Planning & Technology, 2015,38(5):569-583.
|
[41] |
Huang H X, Song J H, Lin X L, et al. TGraph: A Temporal Graph Data Management System[C] //Proceedings of the 25th ACM International Conference on Information and Knowledge Management. 2016: 2469-2472.
|
[42] |
Khurana U, Deshpande A. Efficient Snapshot Retrieval over Historical Graph Data[C] //Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE2013). 2013: 997-1008.
|
[43] |
DBMS Popularity Broken down by Database Model[EB/OL]. [2020-04-02]. https://db-engines.com/en/ranking_categories.
|
[44] |
Valiant L G. A Bridging Model for Parallel Computation[J]. Communications of the ACM, 1990,33(8):103-111.
|
[45] |
Malewicz G, Austern M H, Bik A J, et al. Pregel: A System for Large-scale Graph Processing[C] //Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010: 135-146.
|
[46] |
图计算框架回顾 [EB/OL]. [2020-04-02]. https://blog.csdn.net/wjlwangluo/article/details/66972393.
|
[46] |
( Review of Graph Calculation Framework [EB/OL]. [2020-04-02]. https://blog.csdn.net/wjlwangluo/article/details/66972393.)
|
[47] |
Yan S C, Xu D, Zhang B Y, et al. Graph Embedding and Extensions: A General Framework for Dimensionality Reduction[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006,29(1):40-51.
doi: 10.1109/TPAMI.2007.12
pmid: 17108382
|
[48] |
Scarselli F, Gori M, Tsoi A C, et al. The Graph Neural Network Model[J]. IEEE Transactions on Neural Networks, 2008,20(1):61-80.
doi: 10.1109/TNN.2008.2005605
pmid: 19068426
|
[49] |
Defferrard M, Bresson X, Vandergheynst P. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering[C] //Proceedings of the Advances in Neural Information Processing Systems (NIPS). 2016: 3844-3852.
|
[50] |
Veličković P, Cucurull G, Casanova A, et al. Graph Attention Networks[C] //Proceedings of the International Conference on Learning Representations (ICLR). 2018.
|
[51] |
Palmer S, Rock I. Rethinking Perceptual Organization: The Role of Uniform Connectedness[J]. Psychonomic Bulletin & Review, 1994,1(1):29-55.
pmid: 24203413
|
[52] |
Lohmann S, Link V, Marbach E, et al. WebVOWL: Web-based Visualization of Ontologies[C] //Proceedings of the 2014 International Conference on Knowledge Engineering and Knowledge Management. 2014: 154-158.
|
[53] |
Deligiannidis L, Kochut K J, Sheth A P. RDF Data Exploration and Visualization[C] //Proceedings of the ACM 1st Workshop on CyberInfrastructure: Information Management in eScience. 2007: 39-46.
|
[54] |
Heim P, Hellmann S, Lehmann J, et al. RelFinder: Revealing Relationships in RDF Knowledge Bases[C] // Proceedings of the 4th International Conference on Semantic and Digital Media Technologies. 2009: 182-187.
|
[55] |
Hu X, Ye X M, Yu B Q, et al. GeaBase: A High-performance Distributed Graph Database for Industry-scale Applications[J]. International Journal of High Performance Computing and Networking, 2019,15(1/2):12-21.
|
[56] |
田莉霞. 知识图谱研究综述[J]. 软件, 2020,41(4):67-71.
|
[56] |
( Tian Lixia. Review on Knowledge Graphs[J]. Computer Engineering & Software, 2020,41(4):67-71.)
|
[57] |
Dong X, Ding Y, Wang H J, et al. Chem2Bio2RDF Dashboard: Ranking Semantic Associations in Systems Chemical Biology Space[C] // Proceedings of FWCS2010.2010.
|
[58] |
Vidal M E, Raschid L, Márquez N, et al. BioNav: An Ontology-based Framework to Discover Semantic Links in the Cloud of Linked Data[C] //Proceedings of the 7th International Conference on the Semantic Web: Research and Applications. 2010: 441-445.
|
[59] |
Krebs V. Uncloaking Terrorist Networks[J]. First Monday, 2002,7(4). https://doi.org/10.5210/fm..v7i4.941.
|
[60] |
Zhuang C Y, Yuan N J, Song R H, et al. Understanding People Lifestyles: Construction of Urban Movement Knowledge Graph from GPS Trajectory[C] //Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 2017: 3616-3623.
|
[61] |
Chen W, Zhang X, Wang T J, et al. Opinion-aware Knowledge Graph for Political Ideology Detection[C] //Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 2017: 3647-3653.
|
[62] |
Christakis N A, Fowler J H. The Spread of Obesity in a Large Social Network over 32 Years[J]. New England Journal of Medicine, 2007,357(4):370-379.
doi: 10.1056/NEJMsa066082
pmid: 17652652
|
[63] |
Fowler J H, Christakis N A. Dynamic Spread of Happiness in a Large Social Network: Longitudinal Analysis over 20 Years in the Framingham Heart Study[J]. British Medical Journal, 2008,337:1-9.
|
[64] |
Khine P P, Wang Z S. Data Lake: A New Ideology in Big Data Era[C] //Proceedings of the 4th Annual International Conference on Wireless Communication and Sensor Network. 2018,17:03025.
|
[65] |
Reliable Data Lakes at Scale [EB/OL]. [2020-04-02]. https://delta.io/.
|
[66] |
Chen C, Yan X F, Zhu F D, et al. Graph OLAP: Towards Online Analytical Processing on Graphs[C] //Proceedings of the 8th IEEE International Conference on Data Mining. 2008: 103-112.
|
[67] |
车品觉. 建设数据中台,赋能创新改革[J]. 新经济导刊, 2018(10):22-24.
|
[67] |
( Che Pinjue. Building Data Middle Platform, Enable Creative Innovation[J]. New Economy Weekly, 2018(10):22-24.)
|
[68] |
Wu L H, Sun Q L, Desmeth P, et al. World Data Centre for Microorganisms: An Information Infrastructure to Explore and Utilize Preserved Microbial Strains Worldwide[J]. Nucleic Acids Research, 2017,45(D1):D611-D618.
doi: 10.1093/nar/gkw903
pmid: 28053166
|
[69] |
沈志宏, 姚畅, 侯艳飞, 等. 关联大数据管理技术:挑战、对策与实践[J]. 数据分析与知识发现, 2018,2(1):9-20.
|
[69] |
( Shen Zhihong, Yao Chang, Hou Yanfei, et al. Big Linked Data Management: Challenges, Solutions and Practices[J]. Data Analysis and Knowledge Discovery, 2018,2(1):9-20.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|