Big Data Technology Stack Shifting: From SQL Centric to Graph Centric
Shen Zhihong1(),Zhao Zihao1,2,Wang Haibo1
1Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China 2University of Chinese Academy of Sciences, Beijing 100049, China
[Objective] The traditional SQL centric technology stack cannot handle multivariant and heterogeneous data management, large-scale network management, as well as complex network analysis. Therefore, we proposed a new graphic centric technology stack for big data.[Methods] First, we analyzed the advantages of graph-based data model and established a new graph centric technology stack. Then, we developed PandaDB, an intelligent fusion data management system.[Results] The new technology stack performed well in the applications of biological data network and scholar knowledge graph. PandaDB could manage structured and unstructured data fusion.[Limitations] It is difficult to further promote this technology stack due to the lack of supporting tools and complete application ecology.[Conclusions] Our new technology stack will play a greater role in big data applications.
沈志宏,赵子豪,王海波. 以图为中心的新型大数据技术栈研究 *[J]. 数据分析与知识发现, 2020, 4(7): 50-65.
Shen Zhihong,Zhao Zihao,Wang Haibo. Big Data Technology Stack Shifting: From SQL Centric to Graph Centric. Data Analysis and Knowledge Discovery, 2020, 4(7): 50-65.
Codd E F. Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks[J]. ACM SIGMOD Record, 2009,38(1):17-36.
[2]
Codd E F, Codd S B, Salley C T. Providing OLAP (On-Line Analytical Processing) to User-analysts. An IT Mandate[R]. White Paper. Arbor Software Corporation, 1993.
[3]
The Kettle Open Source Data Integration Project[EB/OL]. [ 2020- 04- 02]. http://www.kettle.be/.
[4]
Talend - A Cloud Data Integration Leader (modern ETL) [EB/OL]. [2020- 04- 02]. https://www.talend.com/.
Thusoo A, Sarma J S, Jain N, et al. Hive: A Warehousing Solution over a Map-reduce Framework[J]. Proceedings of the VLDB Endowment, 2009,2(2):1626-1629.
[9]
Kornacker M, Behm A, Bittorf V, et al. Impala: A Modern, Open-Source SQL Engine for Hadoop[C] //Proceedings of the 7th Biennial Conference on Innovative Data Systems Research(CIDR’15). 2015.
[10]
Akhtar S, Magham R. Pro Apache Phoenix: An SQL Driver for HBase[M]. Apress, 2016.
[11]
SQL Interface for Solr Cloud[EB/OL]. [2020- 04- 02]. https://github.com/bluejoe2008/solr-sql.
[12]
Begoli E, Camacho-Rodríguez J, Hyde J, et al. Apache Calcite: A Foundational Framework for Optimized Query Processing over Heterogeneous Data Sources[C] //Proceedings of the 2018 International Conference on Management of Data. 2018: 221-230.
[13]
Arasu A, Babu S, Widom J. The CQL Continuous Query Language: Semantic Foundations and Query Execution[J]. The VLDB Journal, 2006,15(2):121-142.
[14]
Cai L, Chen J J, Chen J, et al. Fusion Insight LibrA: Huawei’s Enterprise Cloud Data Analytics Platform[J]. Proceedings of the VLDB Endowment, 2018,11(12):1822-1834.
[15]
Apache Kylin | Analytical Data Warehouse for Big Data [EB/OL]. [ 2020- 04- 02]. http://kylin.apache.org/cn/.
[16]
Fernandes S, Bernardino J. What is BigQuery?[C] //Proceedings of the 19th International Database Engineering & Applications Symposium. 2015: 202-203.
Wang Y, Yang Y, Zhu W G, et al. SQLFlow: A Bridge Between SQL and Machine Learning[OL]. arXiv Preprint, arXiv:2001. 06846.
[19]
Katal A, Wazid M, Goudar R H. Big Data: Issues, Challenges, Tools and Good Practices[C] //Proceedings of the 6th International Conference on Contemporary Computing (IC3). IEEE, 2013: 404-409.
[20]
Gaag A, Kohn A, Lindemann U. Function-based Solution Retrieval and Semantic Search in Mechanical Engineering[C] //Proceedings the 17th International Conference on Engineering Design (ICED 09). 2009: 147-158.
[21]
Cattell R. Scalable SQL and NoSQL Data Stores[J]. ACM SIGMOD Record, 2010,39(4):12-27.
[22]
Wang G X, Tang J F. The NoSQL Principles and Basic Application of Cassandra Model[C] //Proceedings of International Conference on Computer Science & Service System (CSSS). 2012: 1332-1335.
[23]
Eric B. CAP Twelve Years Later: How the “Rules” Have Changed[J]. Computer, 2012,45(2):23-29.
[24]
Lu J H, Holubová I. Multi-model Data Management: What's New and What's Next?[C] //Proceedings of the 20th International Conference on Extending Database Technology(EDBT). 2017: 602-605.
[25]
Kiran M, Murphy P, Monga I, et al. Lambda Architecture for Cost-effective Batch and Speed Big Data Processing[C] //Proceedings of 2015 IEEE International Conference on Big Data. IEEE, 2015: 2785-2792.
[26]
Questioning the Lamba Architecture [EB/OL]. [ 2020- 04- 02]. http://radar.oreilly.com/2014/07/questioning-the-lambdaarchitecture.html.
[27]
Duggan J, Elmore A J, Stonebraker M, et al. The BigDAWG Polystore System[J]. ACM SIGMOD Record, 2015,44(2):11-16.
[28]
Kwak H, Lee C H, Park H S, et al. What is Twitter, a Social Network or a News Media?[C] //Proceedings of the 19th International Conference on World Wide Web. 2010: 591-600.
[29]
Backstrom L, Boldi P, Rosa M, et al. Four Degrees of Separation[C] //Proceedings of the 4th Annual ACM Web Science Conference. 2012: 45-54.
[30]
漆桂林, 高桓, 吴天星. 知识图谱研究进展[J]. 情报工程, 2017,3(1):4-25.
[30]
( Qi Guilin, Gao Huan, Wu Tianxing. The Research Advances of Knowledge Graph[J]. Technology Intelligence Engineering, 2017,3(1):4-25.)
[31]
SN SciGraph-A Linked Open Data Platform for the Scholarly Domain[EB/OL]. [2020-04-02]. https://www.springernature.com/gp/researchers/scigraph.
[32]
Zhang F J, Liu X, Tang J, et al. OAG: Toward Linking Large-scale Heterogeneous Entity Graphs[C] //Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019: 2585-2595.
[33]
Stanford Large Network Dataset Collection [EB/OL]. [2020-04-02]. https://snap.stanford.edu/data/.
[34]
BigDND: Big Dynamic Network Data [EB/OL]. [ 2020- 04- 02]. http://projects.csail.mit.edu/dnd/.
[35]
Tang J. AMiner: Toward Understanding Big Scholar Data[C] //Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 2016: 467.
( Wang Tao, Jiang Qinghua, Peng Jiajie, et al. A Review of the Construction and Analysis of Gene Co-expression Network[J]. Intelligent Computer and Applications, 2014,4(6):47-50, 53.)
( Li Jianhui, Shen Zhihong, Meng Xiaofeng. Scientific Big Data Management: Concepts,Technologies and System[J]. Journal of Computer Research and Development, 2017,54(2):235-247.)
[38]
Liu Z Q, Chen C C, Yang X X, et al. Heterogeneous Graph Neural Networks for Malicious Account Detection[C] //Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018: 2077-2085.
[39]
Jin J G, Tang L C, Sun L, et al. Enhancing Metro Network Resilience via Localized Integration with Bus Services[J]. Transportation Research Part E: Logistics and Transportation Review, 2014,63:17-30.
[40]
Seriani S, Fernandez R. Planning Guidelines for Metro-bus Interchanges by Means of a Pedestrian Microsimulation Model[J]. Transportation Planning & Technology, 2015,38(5):569-583.
[41]
Huang H X, Song J H, Lin X L, et al. TGraph: A Temporal Graph Data Management System[C] //Proceedings of the 25th ACM International Conference on Information and Knowledge Management. 2016: 2469-2472.
[42]
Khurana U, Deshpande A. Efficient Snapshot Retrieval over Historical Graph Data[C] //Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE2013). 2013: 997-1008.
[43]
DBMS Popularity Broken down by Database Model[EB/OL]. [2020-04-02]. https://db-engines.com/en/ranking_categories.
[44]
Valiant L G. A Bridging Model for Parallel Computation[J]. Communications of the ACM, 1990,33(8):103-111.
[45]
Malewicz G, Austern M H, Bik A J, et al. Pregel: A System for Large-scale Graph Processing[C] //Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010: 135-146.
( Review of Graph Calculation Framework [EB/OL]. [2020-04-02]. https://blog.csdn.net/wjlwangluo/article/details/66972393.)
[47]
Yan S C, Xu D, Zhang B Y, et al. Graph Embedding and Extensions: A General Framework for Dimensionality Reduction[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006,29(1):40-51.
doi: 10.1109/TPAMI.2007.12
pmid: 17108382
[48]
Scarselli F, Gori M, Tsoi A C, et al. The Graph Neural Network Model[J]. IEEE Transactions on Neural Networks, 2008,20(1):61-80.
doi: 10.1109/TNN.2008.2005605
pmid: 19068426
[49]
Defferrard M, Bresson X, Vandergheynst P. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering[C] //Proceedings of the Advances in Neural Information Processing Systems (NIPS). 2016: 3844-3852.
[50]
Veličković P, Cucurull G, Casanova A, et al. Graph Attention Networks[C] //Proceedings of the International Conference on Learning Representations (ICLR). 2018.
[51]
Palmer S, Rock I. Rethinking Perceptual Organization: The Role of Uniform Connectedness[J]. Psychonomic Bulletin & Review, 1994,1(1):29-55.
pmid: 24203413
[52]
Lohmann S, Link V, Marbach E, et al. WebVOWL: Web-based Visualization of Ontologies[C] //Proceedings of the 2014 International Conference on Knowledge Engineering and Knowledge Management. 2014: 154-158.
[53]
Deligiannidis L, Kochut K J, Sheth A P. RDF Data Exploration and Visualization[C] //Proceedings of the ACM 1st Workshop on CyberInfrastructure: Information Management in eScience. 2007: 39-46.
[54]
Heim P, Hellmann S, Lehmann J, et al. RelFinder: Revealing Relationships in RDF Knowledge Bases[C] // Proceedings of the 4th International Conference on Semantic and Digital Media Technologies. 2009: 182-187.
[55]
Hu X, Ye X M, Yu B Q, et al. GeaBase: A High-performance Distributed Graph Database for Industry-scale Applications[J]. International Journal of High Performance Computing and Networking, 2019,15(1/2):12-21.
Dong X, Ding Y, Wang H J, et al. Chem2Bio2RDF Dashboard: Ranking Semantic Associations in Systems Chemical Biology Space[C] // Proceedings of FWCS2010.2010.
[58]
Vidal M E, Raschid L, Márquez N, et al. BioNav: An Ontology-based Framework to Discover Semantic Links in the Cloud of Linked Data[C] //Proceedings of the 7th International Conference on the Semantic Web: Research and Applications. 2010: 441-445.
[59]
Krebs V. Uncloaking Terrorist Networks[J]. First Monday, 2002,7(4). https://doi.org/10.5210/fm..v7i4.941.
[60]
Zhuang C Y, Yuan N J, Song R H, et al. Understanding People Lifestyles: Construction of Urban Movement Knowledge Graph from GPS Trajectory[C] //Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 2017: 3616-3623.
[61]
Chen W, Zhang X, Wang T J, et al. Opinion-aware Knowledge Graph for Political Ideology Detection[C] //Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 2017: 3647-3653.
[62]
Christakis N A, Fowler J H. The Spread of Obesity in a Large Social Network over 32 Years[J]. New England Journal of Medicine, 2007,357(4):370-379.
doi: 10.1056/NEJMsa066082
pmid: 17652652
[63]
Fowler J H, Christakis N A. Dynamic Spread of Happiness in a Large Social Network: Longitudinal Analysis over 20 Years in the Framingham Heart Study[J]. British Medical Journal, 2008,337:1-9.
[64]
Khine P P, Wang Z S. Data Lake: A New Ideology in Big Data Era[C] //Proceedings of the 4th Annual International Conference on Wireless Communication and Sensor Network. 2018,17:03025.
[65]
Reliable Data Lakes at Scale [EB/OL]. [2020-04-02]. https://delta.io/.
[66]
Chen C, Yan X F, Zhu F D, et al. Graph OLAP: Towards Online Analytical Processing on Graphs[C] //Proceedings of the 8th IEEE International Conference on Data Mining. 2008: 103-112.
[67]
车品觉. 建设数据中台,赋能创新改革[J]. 新经济导刊, 2018(10):22-24.
[67]
( Che Pinjue. Building Data Middle Platform, Enable Creative Innovation[J]. New Economy Weekly, 2018(10):22-24.)
[68]
Wu L H, Sun Q L, Desmeth P, et al. World Data Centre for Microorganisms: An Information Infrastructure to Explore and Utilize Preserved Microbial Strains Worldwide[J]. Nucleic Acids Research, 2017,45(D1):D611-D618.
doi: 10.1093/nar/gkw903
pmid: 28053166
( Shen Zhihong, Yao Chang, Hou Yanfei, et al. Big Linked Data Management: Challenges, Solutions and Practices[J]. Data Analysis and Knowledge Discovery, 2018,2(1):9-20.)