Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (7): 50-65     https://doi.org/10.11925/infotech.2096-3467.2020.0452
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
以图为中心的新型大数据技术栈研究 *
沈志宏1(),赵子豪1,2,王海波1
1中国科学院计算机网络信息中心 北京 100190
2中国科学院大学 北京 100049
Big Data Technology Stack Shifting: From SQL Centric to Graph Centric
Shen Zhihong1(),Zhao Zihao1,2,Wang Haibo1
1Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
2University of Chinese Academy of Sciences, Beijing 100049, China
全文: PDF (2235 KB)   HTML ( 58
输出: BibTeX | EndNote (RIS)      
摘要 

目的】传统的以SQL为中心的技术栈无法有效地应对大数据场景带来的多元异构数据管理、大规模关系网络管理和复杂网络分析等挑战,本文针对新型大数据技术栈展开研究。【方法】通过分析图数据模型的优势,结合图技术的发展和应用现状,提出以图为中心的新型大数据技术栈,并介绍了智能融合数据管理系统PandaDB。【结果】该技术栈在生物数据网络、科技知识图谱等实际应用中得到较好的验证,PandaDB具备良好的结构化、非结构化数据融合管理能力。【局限】 该技术栈的大面积推广还存在支撑工具不足、应用生态不够成熟等困难。【结论】以图为中心的新型大数据技术栈会在更多的大数据应用场景中发挥更大的价值。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
沈志宏
赵子豪
王海波
关键词 图模型图数据库数据仓库技术栈    
Abstract

[Objective] The traditional SQL centric technology stack cannot handle multivariant and heterogeneous data management, large-scale network management, as well as complex network analysis. Therefore, we proposed a new graphic centric technology stack for big data.[Methods] First, we analyzed the advantages of graph-based data model and established a new graph centric technology stack. Then, we developed PandaDB, an intelligent fusion data management system.[Results] The new technology stack performed well in the applications of biological data network and scholar knowledge graph. PandaDB could manage structured and unstructured data fusion.[Limitations] It is difficult to further promote this technology stack due to the lack of supporting tools and complete application ecology.[Conclusions] Our new technology stack will play a greater role in big data applications.

Key wordsGraph Model    Graph Database    Data Warehouse    Technolgy Stack
收稿日期: 2020-05-20      出版日期: 2020-07-25
ZTFLH:  TP393  
基金资助:*本文系国家重点研发计划云计算和大数据专项“科学大数据管理系统”(项目编号:2016YFB1000605);中国科学院计算机网络信息中心与国家自然科学基金委员会合作项目“国家自然科学基金大数据知识管理服务平台”(项目编号:GC-FG4161781);中国烟草总公司科技重大专项项目“烟草科研数据融合与关联挖掘关键技术研究”的研究成果之一(项目编号:110201801019(SJ-01))
通讯作者: 沈志宏     E-mail: bluejoe@cnic.cn
引用本文:   
沈志宏,赵子豪,王海波. 以图为中心的新型大数据技术栈研究 *[J]. 数据分析与知识发现, 2020, 4(7): 50-65.
Shen Zhihong,Zhao Zihao,Wang Haibo. Big Data Technology Stack Shifting: From SQL Centric to Graph Centric. Data Analysis and Knowledge Discovery, 2020, 4(7): 50-65.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.0452      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I7/50
Fig.1  Apache Kylin提供SQL的多维统计接口[15]
Fig.2  以SQL为中心的技术栈
Fig.3  BigDAWG系统架构[27]
名称 顶点规模 边规模 描述
Wiki-Talk 2 394 385篇文章 5 021 410条交流关系 Wikipedia Talk网络
Amazon0601 403 394类商品 3 387 388条“合买”(Co-purchasing) Amazon产品合买记录
Flickr 11 195 144张照片 34 734 221条“喜欢” Flickr照片及“喜欢”记录
USA Patents 3 774 768项专利 16 518 948条引用关系 美国专利(1975~1999年)及引用关系
DBLP Data 4 215 613篇论文 9 086 030条与作者的关系 DBLP论文及作者关系
musae-github 37 700个深度开发者 289 003条“关注” GitHub开发者关系网络
roadNet-CA 1 965 206个路口 2 766 607条道路 California公路网络
Table 1  关系网络数据集的数据规模示例
映射方法 顶点
关系模型 表的一行映射成一个顶点,每一列列映射成顶点的属性 主外键关联映射成边
KV模型 一个KV对映射成一个具有一个属性的顶点
列式模型 表的一行映射成一个顶点,每一列列映射成顶点的属性
文档模型 一个文档映射成一个顶点,文档的字段映射成顶点的属性 文档的嵌套关系映射成边
Table 2  图数据模型对其他模型的表达能力
Fig.4  图数据库发展趋势[43]
Fig.5  图计算框架发展历史[46]
Fig.6  恐怖分子网络[59]
Fig.7  以图为中心的新型大数据技术栈
Fig.8  数据湖管理系统Delta Lake[65]
Fig.9  图数据中台
工具技术 以SQL为中心的技术栈 以图为中心的技术栈
数据库 关系数据库
查询语言为SQL
驱动包括ODBC、JDBC、DAO等
图数据库
查询语言包括Cypher、SPARQL、Gremlin等
数据湖 结构化、半结构化、非结构化数据的集中混搭式管理
其中结构化数据以关系表为主
一张图管理:基于图的结构化、半结构化、非结构化数据的融合管理
数据仓库 多维数据仓库 多维数据仓库+图数据仓库,增强关系挖掘、社区挖掘等能力
ETL ETL多基于SQL进行 gETL:以图数据为主,包括实体抽取、关系抽取、实体消歧、链接预测等任务
大数据中台 数据服务以SQL报表、数据库CRUD为主 图数据:提倡以图为核心实现数据资产的管理,服务以网络分析、图谱可视化为主中台
Table 3  以SQL、图为中心的技术栈之间的比较
Fig.10  采用属性图表示结构化、非结构化数据
操作符 含义 示例
:: 计算x和y之间的相似度 x::y=0.7
~: 计算x和y是否相似? x~:y=true
!: 计算x和y是否不相似? x!:y=false
<: 计算x是否在y里 x<:y=true
>: 计算x是否包含y y>:x=true
Table 4  CypherPlus针对Package定义的语义操作符
Fig.11  AIPM技术架构
Fig.12  Cypher查询的执行过程
Fig.13  PandaDB总体架构
[1] Codd E F. Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks[J]. ACM SIGMOD Record, 2009,38(1):17-36.
[2] Codd E F, Codd S B, Salley C T. Providing OLAP (On-Line Analytical Processing) to User-analysts. An IT Mandate[R]. White Paper. Arbor Software Corporation, 1993.
[3] The Kettle Open Source Data Integration Project[EB/OL]. [ 2020- 04- 02]. http://www.kettle.be/.
[4] Talend - A Cloud Data Integration Leader (modern ETL) [EB/OL]. [2020- 04- 02]. https://www.talend.com/.
[5] Enterprise Cloud Data Management | Informatica [EB/OL]. [2020- 04- 02]. https://www.informatica.com/.
[6] DataX[EB/OL]. [2020- 04- 02]. https://github.com/alibaba/DataX.
[7] Oracle GoldenGate [EB/OL]. [2020- 04- 02]. https://www.oracle.com/middleware/technologies/goldengate.html.
[8] Thusoo A, Sarma J S, Jain N, et al. Hive: A Warehousing Solution over a Map-reduce Framework[J]. Proceedings of the VLDB Endowment, 2009,2(2):1626-1629.
[9] Kornacker M, Behm A, Bittorf V, et al. Impala: A Modern, Open-Source SQL Engine for Hadoop[C] //Proceedings of the 7th Biennial Conference on Innovative Data Systems Research(CIDR’15). 2015.
[10] Akhtar S, Magham R. Pro Apache Phoenix: An SQL Driver for HBase[M]. Apress, 2016.
[11] SQL Interface for Solr Cloud[EB/OL]. [2020- 04- 02]. https://github.com/bluejoe2008/solr-sql.
[12] Begoli E, Camacho-Rodríguez J, Hyde J, et al. Apache Calcite: A Foundational Framework for Optimized Query Processing over Heterogeneous Data Sources[C] //Proceedings of the 2018 International Conference on Management of Data. 2018: 221-230.
[13] Arasu A, Babu S, Widom J. The CQL Continuous Query Language: Semantic Foundations and Query Execution[J]. The VLDB Journal, 2006,15(2):121-142.
[14] Cai L, Chen J J, Chen J, et al. Fusion Insight LibrA: Huawei’s Enterprise Cloud Data Analytics Platform[J]. Proceedings of the VLDB Endowment, 2018,11(12):1822-1834.
[15] Apache Kylin | Analytical Data Warehouse for Big Data [EB/OL]. [ 2020- 04- 02]. http://kylin.apache.org/cn/.
[16] Fernandes S, Bernardino J. What is BigQuery?[C] //Proceedings of the 19th International Database Engineering & Applications Symposium. 2015: 202-203.
[17] BigQuery ML[EB/OL]. [2020- 04- 02]. https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro.
[18] Wang Y, Yang Y, Zhu W G, et al. SQLFlow: A Bridge Between SQL and Machine Learning[OL]. arXiv Preprint, arXiv:2001. 06846.
[19] Katal A, Wazid M, Goudar R H. Big Data: Issues, Challenges, Tools and Good Practices[C] //Proceedings of the 6th International Conference on Contemporary Computing (IC3). IEEE, 2013: 404-409.
[20] Gaag A, Kohn A, Lindemann U. Function-based Solution Retrieval and Semantic Search in Mechanical Engineering[C] //Proceedings the 17th International Conference on Engineering Design (ICED 09). 2009: 147-158.
[21] Cattell R. Scalable SQL and NoSQL Data Stores[J]. ACM SIGMOD Record, 2010,39(4):12-27.
[22] Wang G X, Tang J F. The NoSQL Principles and Basic Application of Cassandra Model[C] //Proceedings of International Conference on Computer Science & Service System (CSSS). 2012: 1332-1335.
[23] Eric B. CAP Twelve Years Later: How the “Rules” Have Changed[J]. Computer, 2012,45(2):23-29.
[24] Lu J H, Holubová I. Multi-model Data Management: What's New and What's Next?[C] //Proceedings of the 20th International Conference on Extending Database Technology(EDBT). 2017: 602-605.
[25] Kiran M, Murphy P, Monga I, et al. Lambda Architecture for Cost-effective Batch and Speed Big Data Processing[C] //Proceedings of 2015 IEEE International Conference on Big Data. IEEE, 2015: 2785-2792.
[26] Questioning the Lamba Architecture [EB/OL]. [ 2020- 04- 02]. http://radar.oreilly.com/2014/07/questioning-the-lambdaarchitecture.html.
[27] Duggan J, Elmore A J, Stonebraker M, et al. The BigDAWG Polystore System[J]. ACM SIGMOD Record, 2015,44(2):11-16.
[28] Kwak H, Lee C H, Park H S, et al. What is Twitter, a Social Network or a News Media?[C] //Proceedings of the 19th International Conference on World Wide Web. 2010: 591-600.
[29] Backstrom L, Boldi P, Rosa M, et al. Four Degrees of Separation[C] //Proceedings of the 4th Annual ACM Web Science Conference. 2012: 45-54.
[30] 漆桂林, 高桓, 吴天星. 知识图谱研究进展[J]. 情报工程, 2017,3(1):4-25.
[30] ( Qi Guilin, Gao Huan, Wu Tianxing. The Research Advances of Knowledge Graph[J]. Technology Intelligence Engineering, 2017,3(1):4-25.)
[31] SN SciGraph-A Linked Open Data Platform for the Scholarly Domain[EB/OL]. [2020-04-02]. https://www.springernature.com/gp/researchers/scigraph.
[32] Zhang F J, Liu X, Tang J, et al. OAG: Toward Linking Large-scale Heterogeneous Entity Graphs[C] //Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019: 2585-2595.
[33] Stanford Large Network Dataset Collection [EB/OL]. [2020-04-02]. https://snap.stanford.edu/data/.
[34] BigDND: Big Dynamic Network Data [EB/OL]. [ 2020- 04- 02]. http://projects.csail.mit.edu/dnd/.
[35] Tang J. AMiner: Toward Understanding Big Scholar Data[C] //Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 2016: 467.
[36] 汪涛, 蒋庆华, 彭佳杰, 等. 基因共表达网络的构建及分析方法研究综述[J]. 智能计算机与应用, 2014,4(6):47-50,53.
[36] ( Wang Tao, Jiang Qinghua, Peng Jiajie, et al. A Review of the Construction and Analysis of Gene Co-expression Network[J]. Intelligent Computer and Applications, 2014,4(6):47-50, 53.)
[37] 黎建辉, 沈志宏, 孟小峰. 科学大数据管理:概念、技术与系统[J]. 计算机研究与发展, 2017,54(2):235-247.
[37] ( Li Jianhui, Shen Zhihong, Meng Xiaofeng. Scientific Big Data Management: Concepts,Technologies and System[J]. Journal of Computer Research and Development, 2017,54(2):235-247.)
[38] Liu Z Q, Chen C C, Yang X X, et al. Heterogeneous Graph Neural Networks for Malicious Account Detection[C] //Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018: 2077-2085.
[39] Jin J G, Tang L C, Sun L, et al. Enhancing Metro Network Resilience via Localized Integration with Bus Services[J]. Transportation Research Part E: Logistics and Transportation Review, 2014,63:17-30.
[40] Seriani S, Fernandez R. Planning Guidelines for Metro-bus Interchanges by Means of a Pedestrian Microsimulation Model[J]. Transportation Planning & Technology, 2015,38(5):569-583.
[41] Huang H X, Song J H, Lin X L, et al. TGraph: A Temporal Graph Data Management System[C] //Proceedings of the 25th ACM International Conference on Information and Knowledge Management. 2016: 2469-2472.
[42] Khurana U, Deshpande A. Efficient Snapshot Retrieval over Historical Graph Data[C] //Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE2013). 2013: 997-1008.
[43] DBMS Popularity Broken down by Database Model[EB/OL]. [2020-04-02]. https://db-engines.com/en/ranking_categories.
[44] Valiant L G. A Bridging Model for Parallel Computation[J]. Communications of the ACM, 1990,33(8):103-111.
[45] Malewicz G, Austern M H, Bik A J, et al. Pregel: A System for Large-scale Graph Processing[C] //Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010: 135-146.
[46] 图计算框架回顾 [EB/OL]. [2020-04-02]. https://blog.csdn.net/wjlwangluo/article/details/66972393.
[46] ( Review of Graph Calculation Framework [EB/OL]. [2020-04-02]. https://blog.csdn.net/wjlwangluo/article/details/66972393.)
[47] Yan S C, Xu D, Zhang B Y, et al. Graph Embedding and Extensions: A General Framework for Dimensionality Reduction[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006,29(1):40-51.
doi: 10.1109/TPAMI.2007.12 pmid: 17108382
[48] Scarselli F, Gori M, Tsoi A C, et al. The Graph Neural Network Model[J]. IEEE Transactions on Neural Networks, 2008,20(1):61-80.
doi: 10.1109/TNN.2008.2005605 pmid: 19068426
[49] Defferrard M, Bresson X, Vandergheynst P. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering[C] //Proceedings of the Advances in Neural Information Processing Systems (NIPS). 2016: 3844-3852.
[50] Veličković P, Cucurull G, Casanova A, et al. Graph Attention Networks[C] //Proceedings of the International Conference on Learning Representations (ICLR). 2018.
[51] Palmer S, Rock I. Rethinking Perceptual Organization: The Role of Uniform Connectedness[J]. Psychonomic Bulletin & Review, 1994,1(1):29-55.
pmid: 24203413
[52] Lohmann S, Link V, Marbach E, et al. WebVOWL: Web-based Visualization of Ontologies[C] //Proceedings of the 2014 International Conference on Knowledge Engineering and Knowledge Management. 2014: 154-158.
[53] Deligiannidis L, Kochut K J, Sheth A P. RDF Data Exploration and Visualization[C] //Proceedings of the ACM 1st Workshop on CyberInfrastructure: Information Management in eScience. 2007: 39-46.
[54] Heim P, Hellmann S, Lehmann J, et al. RelFinder: Revealing Relationships in RDF Knowledge Bases[C] // Proceedings of the 4th International Conference on Semantic and Digital Media Technologies. 2009: 182-187.
[55] Hu X, Ye X M, Yu B Q, et al. GeaBase: A High-performance Distributed Graph Database for Industry-scale Applications[J]. International Journal of High Performance Computing and Networking, 2019,15(1/2):12-21.
[56] 田莉霞. 知识图谱研究综述[J]. 软件, 2020,41(4):67-71.
[56] ( Tian Lixia. Review on Knowledge Graphs[J]. Computer Engineering & Software, 2020,41(4):67-71.)
[57] Dong X, Ding Y, Wang H J, et al. Chem2Bio2RDF Dashboard: Ranking Semantic Associations in Systems Chemical Biology Space[C] // Proceedings of FWCS2010.2010.
[58] Vidal M E, Raschid L, Márquez N, et al. BioNav: An Ontology-based Framework to Discover Semantic Links in the Cloud of Linked Data[C] //Proceedings of the 7th International Conference on the Semantic Web: Research and Applications. 2010: 441-445.
[59] Krebs V. Uncloaking Terrorist Networks[J]. First Monday, 2002,7(4). https://doi.org/10.5210/fm..v7i4.941.
[60] Zhuang C Y, Yuan N J, Song R H, et al. Understanding People Lifestyles: Construction of Urban Movement Knowledge Graph from GPS Trajectory[C] //Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 2017: 3616-3623.
[61] Chen W, Zhang X, Wang T J, et al. Opinion-aware Knowledge Graph for Political Ideology Detection[C] //Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 2017: 3647-3653.
[62] Christakis N A, Fowler J H. The Spread of Obesity in a Large Social Network over 32 Years[J]. New England Journal of Medicine, 2007,357(4):370-379.
doi: 10.1056/NEJMsa066082 pmid: 17652652
[63] Fowler J H, Christakis N A. Dynamic Spread of Happiness in a Large Social Network: Longitudinal Analysis over 20 Years in the Framingham Heart Study[J]. British Medical Journal, 2008,337:1-9.
[64] Khine P P, Wang Z S. Data Lake: A New Ideology in Big Data Era[C] //Proceedings of the 4th Annual International Conference on Wireless Communication and Sensor Network. 2018,17:03025.
[65] Reliable Data Lakes at Scale [EB/OL]. [2020-04-02]. https://delta.io/.
[66] Chen C, Yan X F, Zhu F D, et al. Graph OLAP: Towards Online Analytical Processing on Graphs[C] //Proceedings of the 8th IEEE International Conference on Data Mining. 2008: 103-112.
[67] 车品觉. 建设数据中台,赋能创新改革[J]. 新经济导刊, 2018(10):22-24.
[67] ( Che Pinjue. Building Data Middle Platform, Enable Creative Innovation[J]. New Economy Weekly, 2018(10):22-24.)
[68] Wu L H, Sun Q L, Desmeth P, et al. World Data Centre for Microorganisms: An Information Infrastructure to Explore and Utilize Preserved Microbial Strains Worldwide[J]. Nucleic Acids Research, 2017,45(D1):D611-D618.
doi: 10.1093/nar/gkw903 pmid: 28053166
[69] 沈志宏, 姚畅, 侯艳飞, 等. 关联大数据管理技术:挑战、对策与实践[J]. 数据分析与知识发现, 2018,2(1):9-20.
[69] ( Shen Zhihong, Yao Chang, Hou Yanfei, et al. Big Linked Data Management: Challenges, Solutions and Practices[J]. Data Analysis and Knowledge Discovery, 2018,2(1):9-20.)
[1] 单晓红,王春稳,刘晓燕,韩晟熙,杨娟. 开放式创新社区领先用户识别——知识基础观视角*[J]. 数据分析与知识发现, 2021, 5(9): 85-96.
[2] 常志军,钱力,谢靖,吴振新,张鹄,于倩倩,王颖,王永吉. 基于分布式技术的科技文献大数据平台的建设研究*[J]. 数据分析与知识发现, 2021, 5(3): 69-77.
[3] 孙明珠,马静,钱玲飞. 基于文档主题结构和词图迭代的关键词抽取方法研究 *[J]. 数据分析与知识发现, 2019, 3(8): 68-76.
[4] 王安,顾益军,李坤明,李文政. 基于复杂网络词节点移除的关键词抽取方法 *[J]. 数据分析与知识发现, 2019, 3(11): 35-44.
[5] 宁建飞,刘降珍. 融合Word2vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016, 32(6): 20-27.
[6] 翟东升, 蔡力伟, 张杰, 冯秀珍. 基于专利数据仓库的技术功效图挖掘方法研究——以3D打印技术为例[J]. 现代图书情报技术, 2015, 31(7-8): 131-138.
[7] 李军锋, 吕学强, 周绍钧. 带权复杂图模型的专利关键词标引研究[J]. 现代图书情报技术, 2015, 31(3): 26-32.
[8] 夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013, 29(9): 30-34.
[9] 周静, 赵英, 杨欣. 基于CWM的ETL元数据库系统模型的设计[J]. 现代图书情报技术, 2011, 27(1): 88-93.
[10] 齐玮,王秀芳,王翔宇 . 军队院校图书馆数据仓库设计[J]. 现代图书情报技术, 2006, 1(8): 77-79.
[11] 王兰成,敖毅,曾琼 . 异构多信息源组织与集成技术的研究现状及其进展*[J]. 现代图书情报技术, 2006, 1(3): 68-71.
[12] 王汾,张玉峰. 用户导航历史的半结构时序图模型研究*[J]. 现代图书情报技术, 2006, 1(2): 59-62.
[13] 李康. 数据仓库在证券行业中的应用研究[J]. 现代图书情报技术, 2005, 21(12): 71-73.
[14] 金莹,邓三鸿,李勇. 决策支持技术在电子政务中的应用*——以大社保领域为例   [J]. 现代图书情报技术, 2004, 20(9): 66-69.
[15] 万里云. 数据仓库技术以及在证券业应用展望[J]. 现代图书情报技术, 2002, 18(4): 64-68.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn