Please wait a minute...
Advanced Search
数据分析与知识发现  2018, Vol. 2 Issue (1): 9-20     https://doi.org/10.11925/infotech.2096-3467.2017.1341
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
关联大数据管理技术: 挑战、对策与实践*
沈志宏1(), 姚畅2, 侯艳飞1, 吴林寰3, 李跃鹏1
1(中国科学院计算机网络信息中心 北京 100190)
2(国家自然科学基金委员会 北京 100085)
3(中国科学院微生物研究所 北京 100101)
Big Linked Data Management: Challenges, Solutions and Practices
Shen Zhihong1(), Yao Chang2, Hou Yanfei1, Wu Linhuan3, Li Yuepeng1
1(Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China)
2(National Natural Science Foundation, Beijing 100085, China)
3(Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China)
全文: PDF (2794 KB)   HTML ( 2
输出: BibTeX | EndNote (RIS)      
摘要 

目的】分析关联大数据的概念、内涵与特征, 针对关联大数据管理的技术挑战, 探讨关联大数据管理技术的对策和解决思路。【方法】结合NoSQL数据管理技术、分布式图计算技术、大数据流水线技术等给出应对挑战的思路, 并基于此思路形成大规模图数据仓库加工系统gETL。【结果】该方法和系统在NSFC-KBMS和WDCM项目中得到了应用, 实现了大规模知识型数据和生物数据的有效管理, 满足了多元化的数据管理需求。【局限】需要结合应用的情况, 进一步完善方法与系统。【结论】通过采用NoSQL数据存储技术、分布式图计算技术、大数据流水线技术以及gETL系统, 可以很好地解决关联大数据的管理问题。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
沈志宏
姚畅
侯艳飞
吴林寰
李跃鹏
关键词 关联数据知识图谱大数据关联大数据    
Abstract

[Objective] This article analyzed the concept, connotation and characteristics of the big linked data, aiming to explore possible solutions for technical challenges facing its management. [Methods] We proposed a new model based on NoSQL data management, distributed graph computing and big data pipeline technologies, which designed and develop gETL, a large-scale graph data warehouse processing system. [Results] The proposed system was used in NSFC-KBMS and WDCM projects, which effectively manages large-scale knowledge-data and biological data. [Limitations] The proposed system could be improved with new applications. [Conclusions] The NoSQL data storage, distributed graph computing, and big data pipeline technologies, as well as the gETL system, help us address the challenges facing linked big data management.

Key wordsLinked Data    Knowledge Graph    Big Data    Big Linked Data
收稿日期: 2017-12-12      出版日期: 2018-02-05
ZTFLH:  TP393  
基金资助:*本文系国家重点研发计划云计算和大数据专项“科学大数据管理系统”(项目编号: 2016YFB1000605)和中国科学院计算机网络信息中心与国家自然科学基金委员会合作项目“国家自然科学基金大数据知识管理服务平台”(项目编号: GC-FG4161781)的研究成果之一
引用本文:   
沈志宏, 姚畅, 侯艳飞, 吴林寰, 李跃鹏. 关联大数据管理技术: 挑战、对策与实践*[J]. 数据分析与知识发现, 2018, 2(1): 9-20.
Shen Zhihong,Yao Chang,Hou Yanfei,Wu Linhuan,Li Yuepeng. Big Linked Data Management: Challenges, Solutions and Practices. Data Analysis and Knowledge Discovery, 2018, 2(1): 9-20.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2017.1341      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2018/V2/I1/9
  关联大数据的概念层次
  关联大数据管理的典型流程
  关联大数据管理的任务框架
  基于HBase存储的RDF管理引擎[43]
  分布式图的迭代计算[46]
  大规模图数据仓库加工系统gETL
  基于TITAN图引擎的RDF存储
  大数据流水线的抽象模型
  大数据流水线的执行过程
  NSFC-KBMS流水线概览
  NSFC-KBMS大数据网络服务
[1] Berners-Lee T.Design Issues: Linked Data[EB/OL]. [2017- 12-29]..
[2] 沈志宏, 张晓林. 关联数据及其应用现状综述[J]. 现代图书情报技术, 2011(11): 1-9.
[2] (Shen Zhihong, Zhang Xiaolin.Linked Data and Its Applications: An Overview[J]. New Technology of Library and Information Service, 2011(11): 1-9.)
[3] BigData[J]. Nature, 2008, 455(7209): 1-136.
doi: 10.1038/455001a
[4] Big Data [EB/OL]. [2017-12-29]..
[5] 黎建辉, 沈志宏, 孟小峰. 科学大数据管理: 概念、技术与系统[J]. 计算机研究与发展, 2017, 54(2): 235-247.
[5] (Li Jianhui, Shen Zhihong, Meng Xiaofeng.Scientific Big Data Management: Concepts, Technologies and System[J]. Journal of Computer Research and Development, 2017, 54(2): 235-247.)
[6] Hu B, Carvalho N, Laera L, et al.Towards Big Linked Data: A Large-scale, Distributed Semantic Data Storage[C]// Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services, Bali, Indonesia. New York, USA: ACM, 2012: 167-176.
[7] Hitzler P, Janowicz K. Linked Data, Big Data,the 4th Paradigm[J]. Semantic Web, 2013, 4(3): 233-235.
[8] Big Data & Linked Data[EB/OL]. [2017-06-08]. .
[9] Robak S, Franczyk B, Robak M.Applying Big Data and Linked Data Concepts in Supply Chains Management[C]// Proceedings of the 2013 Federated Conference on Computer Science and Information Systems (FedCSIS). IEEE, 2013: 1215-1221.
[10] 刘炜, 夏翠娟, 张春景. 大数据与关联数据: 正在到来的数据技术革命[J]. 现代图书情报技术, 2013(4): 2-9.
[10] (Liu Wei, Xia Cuijuan, Zhang Chunjing.Big Data and Linked Data: The Emerging Data Technology for the Future of Librarianship[J]. New Technology of Library and Information Service, 2013(4): 2-9.
[11] Erling O, Mikhailov I.Virtuoso: RDF Support in a Native RDBMS[A]//Semantic Web Information Management[M]. Springer, Berlin, Heidelberg, 2010: 501-519.
[12] Bizer C, Cyganiak R.D2R Server-Publishing Relational Databases on the Semantic Web[C]//Proceedings of the 5th International Semantic Web Conference. 2006.
[13] Volz J, Bizer C, Gaedke M, et al.Silk - A Link Discovery Framework for the Web of Data[C]//Proceedings of the 2nd Workshop about Linked Data on the Web. 2009.
[14] 李涓子, 侯磊. 知识图谱研究综述[J].山西大学学报: 自然科学版, 2017, 40(3): 454-459.
[14] (Li Juanzi, Hou Lei.Overview of Knowledge Graph[J]. Journal of Shanxi University: Natural Science Edition, 2017, 40(3): 454-459.)
[15] Auer S, Bizer C, Kobilarov G, et al.DBpedia: A Nucleus for a Web of Open Data[A]// The Semantic Web[M]. Springer, Berlin, Heidelberg, 2007.
[16] Suchanek F M, Kasneci G, Weikum G.YAGO: A Large Ontology from Wikipedia and Wordnet[J]. Web Semantics: Science, Services and Agents on the World Wide Web, 2008, 6(3): 203-217.
doi: 10.1016/j.websem.2008.06.001
[17] Vrandečić D, Krötzsch M.Wikidata: A Free Collaborative Knowledgebase[J]. Communications of the ACM, 2014, 57(10): 78-85.
doi: 10.1145/2629489
[18] 知识图谱的应用[EB/OL]. [2017-10-02]. .
[18] (Application of Knowledge Graph [EB/ OL]. [2017-10-02]. Application of Knowledge Graph [EB/ OL]. [2017-10-02]. . [2017-10-02]. Application of Knowledge Graph [EB/ OL]. [2017-10-02].
[19] Barwick H. The ‘Four Vs’ of Big Data. Implementing Information Infrastructure Symposium [EB/OL]. [2012-10- 02]. .
[20] IBM. What is Big Data? [EB/OL]. [2012-10-02]. .
[21] Cyganiak R, Jentzsch A, Abele A, McCrae J. Linking Open Data Cloud Diagram [EB/OL]. [2016-12-02]. .
[22] Wu L, Sun Q, Desmeth P, et al.World Data Centre for Microorganisms: An Information Infrastructure to Explore and Utilize Preserved Microbial Strains Worldwide[J]. Nucleic Acids Research, 2017, 45(D1): D611-D618.
doi: 10.1093/nar/gkw903 pmid: 5210620
[23] Auer S, Demter J, Martin M, et al.Lodstats - An Extensible Framework for High-performance Dataset Analytics[A]// Knowledge Engineering and Knowledge Management[M]. Springer Berlin Heidelberg, 2012: 353-362.
[24] Dong X, Ding Y, Wang H, et al.Chem2Bio2RDF Dashboard: Ranking Semantic Associations in Systems Chemical Biology Space[C]// Proceedings of the 19th World Wide Web Conference on the Future of the Web in Collaboratice Science(FWCS), Raleigh, NC, USA. 2010.
[25] Vidal M E, Raschid L, Márquez N, et al.BioNav: An Ontology-Based Framework to Discover Semantic Links in the Cloud of Linked Data[A]// The Semantic Web: Research and Applications[M]. Springer, Berlin, Heidelberg, 2010.
[26] Hausenblas M. Linked Data Applications[R/OL]. Digital Enterprise Research Institute(DERI), 2009. .
[27] 夏翠娟, 刘炜.关联数据的消费技术及实现[J].大学图书馆学报, 2013, 31(3): 29-37.
doi: 10.3969/j.issn.1002-1027.2013.03.004
[27] (Xia Cuijuan, Liu Wei.Technologies and Implementation of Consuming Linked Data[J]. Journal of Academic Libraries, 2013, 31(3): 29-37.)
doi: 10.3969/j.issn.1002-1027.2013.03.004
[28] Slater T, Bouton C, Huang E S.Beyond Data Integration[J]. Drug Discovery Today, 2008, 13(13-14): 584-589.
doi: 10.1016/j.drudis.2008.01.008
[29] 何少鹏, 黎建辉, 沈志宏, 等. 大规模的RDF数据存储技术综述[J]. 网络新媒体技术, 2013, 2(1): 8-16.
doi: 10.3969/j.issn.2095-347X.2013.01.002
[29] (He Shaopeng, Li Jianhui, Shen Zhihong, et al.Overview of the Storage Technology for Large-scale RDF Data[J]. Microcomputer Applications, 2013, 2(1): 8-16.)
doi: 10.3969/j.issn.2095-347X.2013.01.002
[30] 从语义网到知识图谱——语义技术工程化的回顾与反思[EB/OL]. [2016-12-02]..
[30] (From Semantic Web to Knowledge Graph——Review of the Engineering of Semantic Technology[EB/OL]. [2016-12-02]..)
[31] 沈志宏, 黎建辉, 张晓林. 面向LOD的关联发现过程的定位、目标与复杂性分析[J]. 中国图书馆学报, 2013, 39(6): 101-108.
doi: 10.3969/j.issn.1001-8867.2013.06.009
[31] (Shen Zhihong, Li Jianhui, Zhang Xiaolin.Insights into Link Discovery Process for Linked Open Data: Positioning, Goals and Complexity[J]. Journal of Library Science in China, 2013, 39(6): 101-108.)
doi: 10.3969/j.issn.1001-8867.2013.06.009
[32] Hassanzadeh O, Lim L, Kementsietsidis, et al. A Declarative Framework for Semantic Link Discovery over Relational Data[C] // Proceedings of the 18th World Wide Web Conference (WWW2009). 2009: 1101-1102.
[33] Ngomo A C N, Auer S. LIMES: A Time-efficient Approach for Large-scale Link Discovery on the Web of Data[C]// Proceedings of the 22nd International Joint Conference on Artificial Intelligence. 2011: 2312-2317.
[34] Hassanzadeh O.Publishing Relational Databases as Linked Data [EB/OL]. [2016-12-02].
[35] Scharffe F, Liu Y, Zhou C. RDF-AI: An Architecture for RDF Datasets Matching, Fusion and Interlink[C]//Proceedings of the IJCAI 2009 Workshop on Identity, Reference, and Knowledge Representation (IR-KR). 2009.
[36] Cattell R.Scalable SQL and NoSQL Data Stores[J]. ACM SIGMOD Record, 2010, 39(4): 12-27.
[37] Wang G, Tang J.The NoSQL Principles and Basic Application of Cassandra Model[C]// Proceedings of the 2012 International Conference on Computer Science & Service System (CSSS). 2012: 1332-1335
[38] Brewer E.CAP Twelve Years Later: How the "Rules" Have Changed[J]. Computer, 2012, 45(2): 23-29.
doi: 10.1109/MC.2012.37
[39] Webber J.A Programmatic Introduction to Neo4j[C]// Proceedings of the 3rd Annual Conference on Systems, Programming, and Applications: Software for Humanity. ACM, 2012: 217-218.
[40] Jouili S, Vansteenberghe V.An Empirical Comparison of Graph Databases[C]// Proceedings of the 2013 International Conference on Social Computing (SocialCom). IEEE, 2013: 708-715.
[41] Abreu D D, Flores A, Palma G, et al.Choosing Between Graph Databases and RDF Engines for Consuming and Mining Linked Data[C]// Proceedings of the 4th International Conference on Consuming Linked Data. 2013.
[42] Hernández D, Hogan A, Riveros C, et al.Querying Wikidata: Comparing SPARQL, Relational and Graph Databases[C]// Proceedings of the 15th International Semantic Web Conference. Springer International Publishing, 2016.
[43] Papailiou N, Konstantinou I, Tsoumakos D, et al.H2RDF: Adaptive Query Processing on RDF Data in the Cloud[C]// Proceedings of the 21st International Conference on World Wide Web. ACM, 2012: 397-400.
[44] Low Y, Gonzalez J, Kyrola A, et al.Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud[J]. Proceedings of the VLDB Endowment, 2012, 5(8): 716-727.
doi: 10.14778/2212351
[45] Avery C.Giraph: Large-scale Graph Processing Infrastructure on Hadoop[C]//Proceedings of the Hadoop Summit. 2011.
[46] Xin R S, Gonzalez J E, Franklin M J, et al.Graphx: A Resilient Distributed Graph System on Spark[C]//Proceedings of the 1st International Workshop on Graph Data Management Experiences and Systems. ACM, 2013: 2.
[47] Koitzsch K.Data Pipelines and How to Construct Them[A]// Pro Hadoop Data Analytics[M]. Apress, 2017: 77-90.
[48] Yi X, Liu F, Liu J, et al.Building a Network Highway for Big Data: Architecture and Challenges[J]. IEEE Network, 2014, 28(4): 5-13.
doi: 10.1109/MNET.2014.6863125
[49] Pedregosa F, Varoquaux G, Gramfort A, et al.Scikit-learn: Machine learning in Python[J]. Journal of Machine Learning Research, 2011, 12: 2825-2830.
[50] Meng X R, Bradley J, Yavuz B, et al.Mllib: Machine Learning in Apache Spark[J]. Journal of Machine Learning Research, 2016, 17(1): 1235-1241.
[51] Apache NiFi. An Easy to Use, Powerful, and Reliable System to Process and Distribute Data[EB/OL]. [2016-12-02]. .
[52] Thusoo A, Sarma J S, Jain N, et al.Hive-A Petabyte Scale Data Warehouse Using Hadoop[C]//Proceedings of the 26th International Conference on Data Engineering(ICDE). IEEE, 2010: 996-1005.
[53] Avram A.Gremlin, A Language for Working with Graphs [EB/OL]. [2016-12-02]..
[54] Wang C, Rayan I A, Schwan K. Faster, Larger, Easier: Reining Real-time Big Data Processing in Cloud[C]// Proceedings of the Posters and Demo Track. ACM, 2012.
[55] Ranawade S V, Navale S, Dhamal A, et al. Online Analytical Processing on Hadoop Using Apache Kylin [EB/OL]. [2016- 12-02].
[56] Li L, Shen Z H, Li J H, et al.A Resilient Index Graph for Querying Large Biological Scientific Data[C]//Proceedings of the 2017 IEEE International Congress on Big Data (BigData Congress). 2017: 435-443.
[57] Carbone P, Katsifodimos A, Ewen S, et al.Apache Flink: Stream and Batch Processing in a Single Engine[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 36(4): 28-38.
[58] Jones M.Process Real-time Big Data with Twitter Storm [EB/OL]. [2016-12-02]..
[59] Apache Beam: An Advanced Unified Programming Model [EB/OL]. [2016-12-02]..
[1] 邱尔丽,何鸿魏,易成岐,李慧颖. 基于字符级CNN技术的公共政策网民支持度研究 *[J]. 数据分析与知识发现, 2020, 4(7): 28-37.
[2] 王建冬,于施洋. 构建国家经济大脑的实践探索与初步设想 *[J]. 数据分析与知识发现, 2020, 4(7): 2-17.
[3] 梁野,李小元,许航,胡伊然. CLOpin:一种面向舆情分析与预警领域的跨语言知识图谱架构*[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[4] 吕华揆,洪亮,马费成. 金融股权知识图谱构建与应用*[J]. 数据分析与知识发现, 2020, 4(5): 27-37.
[5] 王建冬. 大数据在经济监测预测研究中的应用进展*[J]. 数据分析与知识发现, 2020, 4(1): 12-26.
[6] 孔贝贝,谢靖,钱力,常志军,吴振新. 科技大数据增值丰富化方法研究与工具研发 *[J]. 数据分析与知识发现, 2019, 3(7): 113-122.
[7] 杨海慈,王军. 宋代学术师承知识图谱的构建与可视化[J]. 数据分析与知识发现, 2019, 3(6): 109-116.
[8] 董晓舟,陈信康. 电子折扣券弹性与经济效益的关系研究 ——一个基于电商平台大数据的混合模型[J]. 数据分析与知识发现, 2019, 3(6): 42-49.
[9] 陆泉,朱安琪,张霁月,陈静. 中文网络健康社区中的用户信息需求挖掘研究*——以求医网肿瘤板块数据为例[J]. 数据分析与知识发现, 2019, 3(4): 22-32.
[10] 丁晟春,侯琳琳,王颖. 基于电商数据的产品知识图谱构建研究*[J]. 数据分析与知识发现, 2019, 3(3): 45-56.
[11] 王颖,钱力,谢靖,常志军,孔贝贝. 科技大数据知识图谱构建模型与方法研究*[J]. 数据分析与知识发现, 2019, 3(1): 15-26.
[12] 钱力,谢靖,常志军,吴振新,张冬荣. 基于科技大数据的智能知识服务体系研究设计*[J]. 数据分析与知识发现, 2019, 3(1): 4-14.
[13] 胡吉颖,谢靖,钱力,付常雷. 基于知识图谱的科技大数据知识发现平台建设*[J]. 数据分析与知识发现, 2019, 3(1): 55-62.
[14] 谢靖,钱力,师洪波,孔贝贝,胡吉颖. 科研学术大数据的精准服务架构设计*[J]. 数据分析与知识发现, 2019, 3(1): 63-71.
[15] 朝乐门, 杨灿军, 王盛杰, 赵俊鹏, 许梦甜. 全球数据科学课程建设现状的实证分析*[J]. 数据分析与知识发现, 2017, 1(6): 12-21.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn