Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (7): 50-65    DOI: 10.11925/infotech.2096-3467.2020.0452
Current Issue | Archive | Adv Search |
Big Data Technology Stack Shifting: From SQL Centric to Graph Centric
Shen Zhihong1(),Zhao Zihao1,2,Wang Haibo1
1Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
2University of Chinese Academy of Sciences, Beijing 100049, China
Download: PDF (2235 KB)   HTML ( 22
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] The traditional SQL centric technology stack cannot handle multivariant and heterogeneous data management, large-scale network management, as well as complex network analysis. Therefore, we proposed a new graphic centric technology stack for big data.[Methods] First, we analyzed the advantages of graph-based data model and established a new graph centric technology stack. Then, we developed PandaDB, an intelligent fusion data management system.[Results] The new technology stack performed well in the applications of biological data network and scholar knowledge graph. PandaDB could manage structured and unstructured data fusion.[Limitations] It is difficult to further promote this technology stack due to the lack of supporting tools and complete application ecology.[Conclusions] Our new technology stack will play a greater role in big data applications.

Key wordsGraph Model      Graph Database      Data Warehouse      Technolgy Stack     
Received: 20 May 2020      Published: 25 July 2020
ZTFLH:  TP393  
Corresponding Authors: Shen Zhihong     E-mail: bluejoe@cnic.cn

Cite this article:

Shen Zhihong,Zhao Zihao,Wang Haibo. Big Data Technology Stack Shifting: From SQL Centric to Graph Centric. Data Analysis and Knowledge Discovery, 2020, 4(7): 50-65.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0452     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I7/50

Multidimensional Analysis Interface of SQL in Apache Kylin
SQL Centric Technology Stack
Architecture of BigDAWG
名称 顶点规模 边规模 描述
Wiki-Talk 2 394 385篇文章 5 021 410条交流关系 Wikipedia Talk网络
Amazon0601 403 394类商品 3 387 388条“合买”(Co-purchasing) Amazon产品合买记录
Flickr 11 195 144张照片 34 734 221条“喜欢” Flickr照片及“喜欢”记录
USA Patents 3 774 768项专利 16 518 948条引用关系 美国专利(1975~1999年)及引用关系
DBLP Data 4 215 613篇论文 9 086 030条与作者的关系 DBLP论文及作者关系
musae-github 37 700个深度开发者 289 003条“关注” GitHub开发者关系网络
roadNet-CA 1 965 206个路口 2 766 607条道路 California公路网络
Size of Network Datasets
映射方法 顶点
关系模型 表的一行映射成一个顶点,每一列列映射成顶点的属性 主外键关联映射成边
KV模型 一个KV对映射成一个具有一个属性的顶点
列式模型 表的一行映射成一个顶点,每一列列映射成顶点的属性
文档模型 一个文档映射成一个顶点,文档的字段映射成顶点的属性 文档的嵌套关系映射成边
Representation of Graph Model for Other Models
Trend of Graph Database Development
History of Graph Computing Framework
Network for Terrorists’ Relationship
Graph Centric Big Data Technology Stack
Data Lake Management System Delta Lake
Graph Data Middle Platform
工具技术 以SQL为中心的技术栈 以图为中心的技术栈
数据库 关系数据库
查询语言为SQL
驱动包括ODBC、JDBC、DAO等
图数据库
查询语言包括Cypher、SPARQL、Gremlin等
数据湖 结构化、半结构化、非结构化数据的集中混搭式管理
其中结构化数据以关系表为主
一张图管理:基于图的结构化、半结构化、非结构化数据的融合管理
数据仓库 多维数据仓库 多维数据仓库+图数据仓库,增强关系挖掘、社区挖掘等能力
ETL ETL多基于SQL进行 gETL:以图数据为主,包括实体抽取、关系抽取、实体消歧、链接预测等任务
大数据中台 数据服务以SQL报表、数据库CRUD为主 图数据:提倡以图为核心实现数据资产的管理,服务以网络分析、图谱可视化为主中台
Comparision of SQL Centric and Graph Centric Technology Stack
Structured and Unstructured Data in Property Graph Model
操作符 含义 示例
:: 计算x和y之间的相似度 x::y=0.7
~: 计算x和y是否相似? x~:y=true
!: 计算x和y是否不相似? x!:y=false
<: 计算x是否在y里 x<:y=true
>: 计算x是否包含y y>:x=true
Semantic Operation Symbols in CypherPlus for Package
Architecture of AIPM
Process of Cypher Query Execution
Architecture of PandaDB
[1] Codd E F. Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks[J]. ACM SIGMOD Record, 2009,38(1):17-36.
[2] Codd E F, Codd S B, Salley C T. Providing OLAP (On-Line Analytical Processing) to User-analysts. An IT Mandate[R]. White Paper. Arbor Software Corporation, 1993.
[3] The Kettle Open Source Data Integration Project[EB/OL]. [ 2020- 04- 02]. http://www.kettle.be/.
[4] Talend - A Cloud Data Integration Leader (modern ETL) [EB/OL]. [2020- 04- 02]. https://www.talend.com/.
[5] Enterprise Cloud Data Management | Informatica [EB/OL]. [2020- 04- 02]. https://www.informatica.com/.
[6] DataX[EB/OL]. [2020- 04- 02]. https://github.com/alibaba/DataX.
[7] Oracle GoldenGate [EB/OL]. [2020- 04- 02]. https://www.oracle.com/middleware/technologies/goldengate.html.
[8] Thusoo A, Sarma J S, Jain N, et al. Hive: A Warehousing Solution over a Map-reduce Framework[J]. Proceedings of the VLDB Endowment, 2009,2(2):1626-1629.
[9] Kornacker M, Behm A, Bittorf V, et al. Impala: A Modern, Open-Source SQL Engine for Hadoop[C] //Proceedings of the 7th Biennial Conference on Innovative Data Systems Research(CIDR’15). 2015.
[10] Akhtar S, Magham R. Pro Apache Phoenix: An SQL Driver for HBase[M]. Apress, 2016.
[11] SQL Interface for Solr Cloud[EB/OL]. [2020- 04- 02]. https://github.com/bluejoe2008/solr-sql.
[12] Begoli E, Camacho-Rodríguez J, Hyde J, et al. Apache Calcite: A Foundational Framework for Optimized Query Processing over Heterogeneous Data Sources[C] //Proceedings of the 2018 International Conference on Management of Data. 2018: 221-230.
[13] Arasu A, Babu S, Widom J. The CQL Continuous Query Language: Semantic Foundations and Query Execution[J]. The VLDB Journal, 2006,15(2):121-142.
[14] Cai L, Chen J J, Chen J, et al. Fusion Insight LibrA: Huawei’s Enterprise Cloud Data Analytics Platform[J]. Proceedings of the VLDB Endowment, 2018,11(12):1822-1834.
[15] Apache Kylin | Analytical Data Warehouse for Big Data [EB/OL]. [ 2020- 04- 02]. http://kylin.apache.org/cn/.
[16] Fernandes S, Bernardino J. What is BigQuery?[C] //Proceedings of the 19th International Database Engineering & Applications Symposium. 2015: 202-203.
[17] BigQuery ML[EB/OL]. [2020- 04- 02]. https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro.
[18] Wang Y, Yang Y, Zhu W G, et al. SQLFlow: A Bridge Between SQL and Machine Learning[OL]. arXiv Preprint, arXiv:2001. 06846.
[19] Katal A, Wazid M, Goudar R H. Big Data: Issues, Challenges, Tools and Good Practices[C] //Proceedings of the 6th International Conference on Contemporary Computing (IC3). IEEE, 2013: 404-409.
[20] Gaag A, Kohn A, Lindemann U. Function-based Solution Retrieval and Semantic Search in Mechanical Engineering[C] //Proceedings the 17th International Conference on Engineering Design (ICED 09). 2009: 147-158.
[21] Cattell R. Scalable SQL and NoSQL Data Stores[J]. ACM SIGMOD Record, 2010,39(4):12-27.
[22] Wang G X, Tang J F. The NoSQL Principles and Basic Application of Cassandra Model[C] //Proceedings of International Conference on Computer Science & Service System (CSSS). 2012: 1332-1335.
[23] Eric B. CAP Twelve Years Later: How the “Rules” Have Changed[J]. Computer, 2012,45(2):23-29.
[24] Lu J H, Holubová I. Multi-model Data Management: What's New and What's Next?[C] //Proceedings of the 20th International Conference on Extending Database Technology(EDBT). 2017: 602-605.
[25] Kiran M, Murphy P, Monga I, et al. Lambda Architecture for Cost-effective Batch and Speed Big Data Processing[C] //Proceedings of 2015 IEEE International Conference on Big Data. IEEE, 2015: 2785-2792.
[26] Questioning the Lamba Architecture [EB/OL]. [ 2020- 04- 02]. http://radar.oreilly.com/2014/07/questioning-the-lambdaarchitecture.html.
[27] Duggan J, Elmore A J, Stonebraker M, et al. The BigDAWG Polystore System[J]. ACM SIGMOD Record, 2015,44(2):11-16.
[28] Kwak H, Lee C H, Park H S, et al. What is Twitter, a Social Network or a News Media?[C] //Proceedings of the 19th International Conference on World Wide Web. 2010: 591-600.
[29] Backstrom L, Boldi P, Rosa M, et al. Four Degrees of Separation[C] //Proceedings of the 4th Annual ACM Web Science Conference. 2012: 45-54.
[30] 漆桂林, 高桓, 吴天星. 知识图谱研究进展[J]. 情报工程, 2017,3(1):4-25.
[30] ( Qi Guilin, Gao Huan, Wu Tianxing. The Research Advances of Knowledge Graph[J]. Technology Intelligence Engineering, 2017,3(1):4-25.)
[31] SN SciGraph-A Linked Open Data Platform for the Scholarly Domain[EB/OL]. [2020-04-02]. https://www.springernature.com/gp/researchers/scigraph.
[32] Zhang F J, Liu X, Tang J, et al. OAG: Toward Linking Large-scale Heterogeneous Entity Graphs[C] //Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019: 2585-2595.
[33] Stanford Large Network Dataset Collection [EB/OL]. [2020-04-02]. https://snap.stanford.edu/data/.
[34] BigDND: Big Dynamic Network Data [EB/OL]. [ 2020- 04- 02]. http://projects.csail.mit.edu/dnd/.
[35] Tang J. AMiner: Toward Understanding Big Scholar Data[C] //Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 2016: 467.
[36] 汪涛, 蒋庆华, 彭佳杰, 等. 基因共表达网络的构建及分析方法研究综述[J]. 智能计算机与应用, 2014,4(6):47-50,53.
[36] ( Wang Tao, Jiang Qinghua, Peng Jiajie, et al. A Review of the Construction and Analysis of Gene Co-expression Network[J]. Intelligent Computer and Applications, 2014,4(6):47-50, 53.)
[37] 黎建辉, 沈志宏, 孟小峰. 科学大数据管理:概念、技术与系统[J]. 计算机研究与发展, 2017,54(2):235-247.
[37] ( Li Jianhui, Shen Zhihong, Meng Xiaofeng. Scientific Big Data Management: Concepts,Technologies and System[J]. Journal of Computer Research and Development, 2017,54(2):235-247.)
[38] Liu Z Q, Chen C C, Yang X X, et al. Heterogeneous Graph Neural Networks for Malicious Account Detection[C] //Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018: 2077-2085.
[39] Jin J G, Tang L C, Sun L, et al. Enhancing Metro Network Resilience via Localized Integration with Bus Services[J]. Transportation Research Part E: Logistics and Transportation Review, 2014,63:17-30.
[40] Seriani S, Fernandez R. Planning Guidelines for Metro-bus Interchanges by Means of a Pedestrian Microsimulation Model[J]. Transportation Planning & Technology, 2015,38(5):569-583.
[41] Huang H X, Song J H, Lin X L, et al. TGraph: A Temporal Graph Data Management System[C] //Proceedings of the 25th ACM International Conference on Information and Knowledge Management. 2016: 2469-2472.
[42] Khurana U, Deshpande A. Efficient Snapshot Retrieval over Historical Graph Data[C] //Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE2013). 2013: 997-1008.
[43] DBMS Popularity Broken down by Database Model[EB/OL]. [2020-04-02]. https://db-engines.com/en/ranking_categories.
[44] Valiant L G. A Bridging Model for Parallel Computation[J]. Communications of the ACM, 1990,33(8):103-111.
[45] Malewicz G, Austern M H, Bik A J, et al. Pregel: A System for Large-scale Graph Processing[C] //Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010: 135-146.
[46] 图计算框架回顾 [EB/OL]. [2020-04-02]. https://blog.csdn.net/wjlwangluo/article/details/66972393.
[46] ( Review of Graph Calculation Framework [EB/OL]. [2020-04-02]. https://blog.csdn.net/wjlwangluo/article/details/66972393.)
[47] Yan S C, Xu D, Zhang B Y, et al. Graph Embedding and Extensions: A General Framework for Dimensionality Reduction[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006,29(1):40-51.
doi: 10.1109/TPAMI.2007.12 pmid: 17108382
[48] Scarselli F, Gori M, Tsoi A C, et al. The Graph Neural Network Model[J]. IEEE Transactions on Neural Networks, 2008,20(1):61-80.
doi: 10.1109/TNN.2008.2005605 pmid: 19068426
[49] Defferrard M, Bresson X, Vandergheynst P. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering[C] //Proceedings of the Advances in Neural Information Processing Systems (NIPS). 2016: 3844-3852.
[50] Veličković P, Cucurull G, Casanova A, et al. Graph Attention Networks[C] //Proceedings of the International Conference on Learning Representations (ICLR). 2018.
[51] Palmer S, Rock I. Rethinking Perceptual Organization: The Role of Uniform Connectedness[J]. Psychonomic Bulletin & Review, 1994,1(1):29-55.
pmid: 24203413
[52] Lohmann S, Link V, Marbach E, et al. WebVOWL: Web-based Visualization of Ontologies[C] //Proceedings of the 2014 International Conference on Knowledge Engineering and Knowledge Management. 2014: 154-158.
[53] Deligiannidis L, Kochut K J, Sheth A P. RDF Data Exploration and Visualization[C] //Proceedings of the ACM 1st Workshop on CyberInfrastructure: Information Management in eScience. 2007: 39-46.
[54] Heim P, Hellmann S, Lehmann J, et al. RelFinder: Revealing Relationships in RDF Knowledge Bases[C] // Proceedings of the 4th International Conference on Semantic and Digital Media Technologies. 2009: 182-187.
[55] Hu X, Ye X M, Yu B Q, et al. GeaBase: A High-performance Distributed Graph Database for Industry-scale Applications[J]. International Journal of High Performance Computing and Networking, 2019,15(1/2):12-21.
[56] 田莉霞. 知识图谱研究综述[J]. 软件, 2020,41(4):67-71.
[56] ( Tian Lixia. Review on Knowledge Graphs[J]. Computer Engineering & Software, 2020,41(4):67-71.)
[57] Dong X, Ding Y, Wang H J, et al. Chem2Bio2RDF Dashboard: Ranking Semantic Associations in Systems Chemical Biology Space[C] // Proceedings of FWCS2010.2010.
[58] Vidal M E, Raschid L, Márquez N, et al. BioNav: An Ontology-based Framework to Discover Semantic Links in the Cloud of Linked Data[C] //Proceedings of the 7th International Conference on the Semantic Web: Research and Applications. 2010: 441-445.
[59] Krebs V. Uncloaking Terrorist Networks[J]. First Monday, 2002,7(4). https://doi.org/10.5210/fm..v7i4.941.
[60] Zhuang C Y, Yuan N J, Song R H, et al. Understanding People Lifestyles: Construction of Urban Movement Knowledge Graph from GPS Trajectory[C] //Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 2017: 3616-3623.
[61] Chen W, Zhang X, Wang T J, et al. Opinion-aware Knowledge Graph for Political Ideology Detection[C] //Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 2017: 3647-3653.
[62] Christakis N A, Fowler J H. The Spread of Obesity in a Large Social Network over 32 Years[J]. New England Journal of Medicine, 2007,357(4):370-379.
doi: 10.1056/NEJMsa066082 pmid: 17652652
[63] Fowler J H, Christakis N A. Dynamic Spread of Happiness in a Large Social Network: Longitudinal Analysis over 20 Years in the Framingham Heart Study[J]. British Medical Journal, 2008,337:1-9.
[64] Khine P P, Wang Z S. Data Lake: A New Ideology in Big Data Era[C] //Proceedings of the 4th Annual International Conference on Wireless Communication and Sensor Network. 2018,17:03025.
[65] Reliable Data Lakes at Scale [EB/OL]. [2020-04-02]. https://delta.io/.
[66] Chen C, Yan X F, Zhu F D, et al. Graph OLAP: Towards Online Analytical Processing on Graphs[C] //Proceedings of the 8th IEEE International Conference on Data Mining. 2008: 103-112.
[67] 车品觉. 建设数据中台,赋能创新改革[J]. 新经济导刊, 2018(10):22-24.
[67] ( Che Pinjue. Building Data Middle Platform, Enable Creative Innovation[J]. New Economy Weekly, 2018(10):22-24.)
[68] Wu L H, Sun Q L, Desmeth P, et al. World Data Centre for Microorganisms: An Information Infrastructure to Explore and Utilize Preserved Microbial Strains Worldwide[J]. Nucleic Acids Research, 2017,45(D1):D611-D618.
doi: 10.1093/nar/gkw903 pmid: 28053166
[69] 沈志宏, 姚畅, 侯艳飞, 等. 关联大数据管理技术:挑战、对策与实践[J]. 数据分析与知识发现, 2018,2(1):9-20.
[69] ( Shen Zhihong, Yao Chang, Hou Yanfei, et al. Big Linked Data Management: Challenges, Solutions and Practices[J]. Data Analysis and Knowledge Discovery, 2018,2(1):9-20.)
[1] Mingzhu Sun,Jing Ma,Lingfei Qian. Extracting Keywords Based on Topic Structure and Word Diagram Iteration[J]. 数据分析与知识发现, 2019, 3(8): 68-76.
[2] An Wang,Yijun Gu,Kunming Li,Wenzheng Li. Extracting Keywords Based on Removed Network Word Nodes[J]. 数据分析与知识发现, 2019, 3(11): 35-44.
[3] Dongsheng Zhai, He Liu, Jie Zhang, Liwei Cai. Managing Patent Semantic Knowledge with Graph Database[J]. 数据分析与知识发现, 2016, 32(12): 66-75.
[4] Zhai Dongsheng, Cai Liwei, Zhang Jie, Feng Xiuzhen. The Study of Patent Data Warehouse-based Technical Efficiency Map Mining Method——Taking 3D Printing Technology as an Example[J]. 现代图书情报技术, 2015, 31(7-8): 131-138.
[5] Li Junfeng, Lv Xueqiang, Zhou Shaojun. Patent Keyword Indexing Based on Weighted Complex Graph Model[J]. 现代图书情报技术, 2015, 31(3): 26-32.
[6] Gu Yijun, Xia Tian. Study on Keyword Extraction with LDA and TextRank Combination[J]. 现代图书情报技术, 2014, 30(7): 41-47.
[7] Xia Tian. Study on Keyword Extraction Using Word Position Weighted TextRank[J]. 现代图书情报技术, 2013, 29(9): 30-34.
[8] Dong Kun. Research of Personalized Book Recommender System of University Library Based on Collaborative Filter[J]. 现代图书情报技术, 2011, (11): 44-47.
[9] Zhou Jing, Zhao Ying, Yang Xin. CWM-based ETL Metadata System Model Design[J]. 现代图书情报技术, 2011, 27(1): 88-93.
[10] Qi Wei,Wang Xiufang,Wang Xiangyu . Data Warehouse Design of Military Institute Library[J]. 现代图书情报技术, 2006, 1(8): 77-79.
[11] Wang Lancheng,Ao Yi,Zeng Qiong . The Development and Research on Heterogeneous Resource Integration of Information Organization and Technology[J]. 现代图书情报技术, 2006, 1(3): 68-71.
[12] Jin Ying,Deng Sanhong,Li Yong. Application of DSS in E-Government: Take Social Security as an Example[J]. 现代图书情报技术, 2004, 20(9): 66-69.
[13] Jin Yan. Data Warehouse and the Library Development[J]. 现代图书情报技术, 2000, 16(3): 13-16.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn