Please wait a minute...
Advanced Search
数据分析与知识发现  2018, Vol. 2 Issue (1): 9-20    DOI: 10.11925/infotech.2096-3467.2017.1341
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
关联大数据管理技术: 挑战、对策与实践*
沈志宏1(),姚畅2,侯艳飞1,吴林寰3,李跃鹏1
1(中国科学院计算机网络信息中心 北京 100190)
2(国家自然科学基金委员会 北京 100085)
3(中国科学院微生物研究所 北京 100101)
Big Linked Data Management: Challenges, Solutions and Practices
Zhihong Shen1(),Chang Yao2,Yanfei Hou1,Linhuan Wu3,Yuepeng Li1
1(Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China)
2(National Natural Science Foundation, Beijing 100085, China)
3(Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China)
全文: PDF(2794 KB)   HTML
输出: BibTeX | EndNote (RIS)      
摘要 

目的】分析关联大数据的概念、内涵与特征, 针对关联大数据管理的技术挑战, 探讨关联大数据管理技术的对策和解决思路。【方法】结合NoSQL数据管理技术、分布式图计算技术、大数据流水线技术等给出应对挑战的思路, 并基于此思路形成大规模图数据仓库加工系统gETL。【结果】该方法和系统在NSFC-KBMS和WDCM项目中得到了应用, 实现了大规模知识型数据和生物数据的有效管理, 满足了多元化的数据管理需求。【局限】需要结合应用的情况, 进一步完善方法与系统。【结论】通过采用NoSQL数据存储技术、分布式图计算技术、大数据流水线技术以及gETL系统, 可以很好地解决关联大数据的管理问题。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
沈志宏
姚畅
侯艳飞
吴林寰
李跃鹏
关键词 关联数据知识图谱大数据关联大数据    
Abstract

[Objective] This article analyzed the concept, connotation and characteristics of the big linked data, aiming to explore possible solutions for technical challenges facing its management. [Methods] We proposed a new model based on NoSQL data management, distributed graph computing and big data pipeline technologies, which designed and develop gETL, a large-scale graph data warehouse processing system. [Results] The proposed system was used in NSFC-KBMS and WDCM projects, which effectively manages large-scale knowledge-data and biological data. [Limitations] The proposed system could be improved with new applications. [Conclusions] The NoSQL data storage, distributed graph computing, and big data pipeline technologies, as well as the gETL system, help us address the challenges facing linked big data management.

Key wordsLinked Data    Knowledge Graph    Big Data    Big Linked Data
收稿日期: 2017-12-12     
基金资助:*本文系国家重点研发计划云计算和大数据专项“科学大数据管理系统”(项目编号: 2016YFB1000605)和中国科学院计算机网络信息中心与国家自然科学基金委员会合作项目“国家自然科学基金大数据知识管理服务平台”(项目编号: GC-FG4161781)的研究成果之一
引用本文:   
沈志宏,姚畅,侯艳飞,吴林寰,李跃鹏. 关联大数据管理技术: 挑战、对策与实践*[J]. 数据分析与知识发现, 2018, 2(1): 9-20.
Zhihong Shen,Chang Yao,Yanfei Hou,Linhuan Wu,Yuepeng Li. Big Linked Data Management: Challenges, Solutions and Practices. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2017.1341.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2017.1341
图1  关联大数据的概念层次
图2  关联大数据管理的典型流程
图3  关联大数据管理的任务框架
图4  基于HBase存储的RDF管理引擎[43]
图5  分布式图的迭代计算[46]
图6  大规模图数据仓库加工系统gETL
图7  基于TITAN图引擎的RDF存储
图8  大数据流水线的抽象模型
图9  大数据流水线的执行过程
图10  NSFC-KBMS流水线概览
图11  NSFC-KBMS大数据网络服务
[1] Berners-Lee T.Design Issues: Linked Data[EB/OL]. [2017- 12-29]..
[2] 沈志宏, 张晓林. 关联数据及其应用现状综述[J]. 现代图书情报技术, 2011(11): 1-9.
(Shen Zhihong, Zhang Xiaolin.Linked Data and Its Applications: An Overview[J]. New Technology of Library and Information Service, 2011(11): 1-9.)
[3] BigData[J]. Nature, 2008, 455(7209): 1-136.
doi: 10.1038/455001a
[4] Big Data [EB/OL]. [2017-12-29]..
[5] 黎建辉, 沈志宏, 孟小峰. 科学大数据管理: 概念、技术与系统[J]. 计算机研究与发展, 2017, 54(2): 235-247.
(Li Jianhui, Shen Zhihong, Meng Xiaofeng.Scientific Big Data Management: Concepts, Technologies and System[J]. Journal of Computer Research and Development, 2017, 54(2): 235-247.)
[6] Hu B, Carvalho N, Laera L, et al.Towards Big Linked Data: A Large-scale, Distributed Semantic Data Storage[C]// Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services, Bali, Indonesia. New York, USA: ACM, 2012: 167-176.
[7] Hitzler P, Janowicz K. Linked Data, Big Data,the 4th Paradigm[J]. Semantic Web, 2013, 4(3): 233-235.
[8] Big Data & Linked Data[EB/OL]. [2017-06-08]. .
[9] Robak S, Franczyk B, Robak M.Applying Big Data and Linked Data Concepts in Supply Chains Management[C]// Proceedings of the 2013 Federated Conference on Computer Science and Information Systems (FedCSIS). IEEE, 2013: 1215-1221.
[10] 刘炜, 夏翠娟, 张春景. 大数据与关联数据: 正在到来的数据技术革命[J]. 现代图书情报技术, 2013(4): 2-9.
(Liu Wei, Xia Cuijuan, Zhang Chunjing.Big Data and Linked Data: The Emerging Data Technology for the Future of Librarianship[J]. New Technology of Library and Information Service, 2013(4): 2-9.
[11] Erling O, Mikhailov I.Virtuoso: RDF Support in a Native RDBMS[A]//Semantic Web Information Management[M]. Springer, Berlin, Heidelberg, 2010: 501-519.
[12] Bizer C, Cyganiak R.D2R Server-Publishing Relational Databases on the Semantic Web[C]//Proceedings of the 5th International Semantic Web Conference. 2006.
[13] Volz J, Bizer C, Gaedke M, et al.Silk - A Link Discovery Framework for the Web of Data[C]//Proceedings of the 2nd Workshop about Linked Data on the Web. 2009.
[14] 李涓子, 侯磊. 知识图谱研究综述[J].山西大学学报: 自然科学版, 2017, 40(3): 454-459.
(Li Juanzi, Hou Lei.Overview of Knowledge Graph[J]. Journal of Shanxi University: Natural Science Edition, 2017, 40(3): 454-459.)
[15] Auer S, Bizer C, Kobilarov G, et al.DBpedia: A Nucleus for a Web of Open Data[A]// The Semantic Web[M]. Springer, Berlin, Heidelberg, 2007.
[16] Suchanek F M, Kasneci G, Weikum G.YAGO: A Large Ontology from Wikipedia and Wordnet[J]. Web Semantics: Science, Services and Agents on the World Wide Web, 2008, 6(3): 203-217.
doi: 10.1016/j.websem.2008.06.001
[17] Vrande?i? D, Kr?tzsch M.Wikidata: A Free Collaborative Knowledgebase[J]. Communications of the ACM, 2014, 57(10): 78-85.
doi: 10.1145/2629489
[18] 知识图谱的应用[EB/OL]. [2017-10-02]. .
(Application of Knowledge Graph [EB/ OL]. [2017-10-02]. Application of Knowledge Graph [EB/ OL]. [2017-10-02]. . [2017-10-02]. Application of Knowledge Graph [EB/ OL]. [2017-10-02].
[19] Barwick H. The ‘Four Vs’ of Big Data. Implementing Information Infrastructure Symposium [EB/OL]. [2012-10- 02]. .
[20] IBM. What is Big Data? [EB/OL]. [2012-10-02]. .
[21] Cyganiak R, Jentzsch A, Abele A, McCrae J. Linking Open Data Cloud Diagram [EB/OL]. [2016-12-02]. .
[22] Wu L, Sun Q, Desmeth P, et al.World Data Centre for Microorganisms: An Information Infrastructure to Explore and Utilize Preserved Microbial Strains Worldwide[J]. Nucleic Acids Research, 2017, 45(D1): D611-D618.
doi: 10.1093/nar/gkw903 pmid: 5210620
[23] Auer S, Demter J, Martin M, et al.Lodstats - An Extensible Framework for High-performance Dataset Analytics[A]// Knowledge Engineering and Knowledge Management[M]. Springer Berlin Heidelberg, 2012: 353-362.
[24] Dong X, Ding Y, Wang H, et al.Chem2Bio2RDF Dashboard: Ranking Semantic Associations in Systems Chemical Biology Space[C]// Proceedings of the 19th World Wide Web Conference on the Future of the Web in Collaboratice Science(FWCS), Raleigh, NC, USA. 2010.
[25] Vidal M E, Raschid L, Márquez N, et al.BioNav: An Ontology-Based Framework to Discover Semantic Links in the Cloud of Linked Data[A]// The Semantic Web: Research and Applications[M]. Springer, Berlin, Heidelberg, 2010.
[26] Hausenblas M. Linked Data Applications[R/OL]. Digital Enterprise Research Institute(DERI), 2009. .
[27] 夏翠娟, 刘炜.关联数据的消费技术及实现[J].大学图书馆学报, 2013, 31(3): 29-37.
doi: 10.3969/j.issn.1002-1027.2013.03.004
(Xia Cuijuan, Liu Wei.Technologies and Implementation of Consuming Linked Data[J]. Journal of Academic Libraries, 2013, 31(3): 29-37.)
[28] Slater T, Bouton C, Huang E S.Beyond Data Integration[J]. Drug Discovery Today, 2008, 13(13-14): 584-589.
doi: 10.1016/j.drudis.2008.01.008
[29] 何少鹏, 黎建辉, 沈志宏, 等. 大规模的RDF数据存储技术综述[J]. 网络新媒体技术, 2013, 2(1): 8-16.
doi: 10.3969/j.issn.2095-347X.2013.01.002
(He Shaopeng, Li Jianhui, Shen Zhihong, et al.Overview of the Storage Technology for Large-scale RDF Data[J]. Microcomputer Applications, 2013, 2(1): 8-16.)
[30] 从语义网到知识图谱——语义技术工程化的回顾与反思[EB/OL]. [2016-12-02]..
(From Semantic Web to Knowledge Graph——Review of the Engineering of Semantic Technology[EB/OL]. [2016-12-02]..)
[31] 沈志宏, 黎建辉, 张晓林. 面向LOD的关联发现过程的定位、目标与复杂性分析[J]. 中国图书馆学报, 2013, 39(6): 101-108.
doi: 10.3969/j.issn.1001-8867.2013.06.009
(Shen Zhihong, Li Jianhui, Zhang Xiaolin.Insights into Link Discovery Process for Linked Open Data: Positioning, Goals and Complexity[J]. Journal of Library Science in China, 2013, 39(6): 101-108.)
[32] Hassanzadeh O, Lim L, Kementsietsidis, et al. A Declarative Framework for Semantic Link Discovery over Relational Data[C] // Proceedings of the 18th World Wide Web Conference (WWW2009). 2009: 1101-1102.
[33] Ngomo A C N, Auer S. LIMES: A Time-efficient Approach for Large-scale Link Discovery on the Web of Data[C]// Proceedings of the 22nd International Joint Conference on Artificial Intelligence. 2011: 2312-2317.
[34] Hassanzadeh O.Publishing Relational Databases as Linked Data [EB/OL]. [2016-12-02].
[35] Scharffe F, Liu Y, Zhou C. RDF-AI: An Architecture for RDF Datasets Matching, Fusion and Interlink[C]//Proceedings of the IJCAI 2009 Workshop on Identity, Reference, and Knowledge Representation (IR-KR). 2009.
[36] Cattell R.Scalable SQL and NoSQL Data Stores[J]. ACM SIGMOD Record, 2010, 39(4): 12-27.
[37] Wang G, Tang J.The NoSQL Principles and Basic Application of Cassandra Model[C]// Proceedings of the 2012 International Conference on Computer Science & Service System (CSSS). 2012: 1332-1335
[38] Brewer E.CAP Twelve Years Later: How the "Rules" Have Changed[J]. Computer, 2012, 45(2): 23-29.
doi: 10.1109/MC.2012.37
[39] Webber J.A Programmatic Introduction to Neo4j[C]// Proceedings of the 3rd Annual Conference on Systems, Programming, and Applications: Software for Humanity. ACM, 2012: 217-218.
[40] Jouili S, Vansteenberghe V.An Empirical Comparison of Graph Databases[C]// Proceedings of the 2013 International Conference on Social Computing (SocialCom). IEEE, 2013: 708-715.
[41] Abreu D D, Flores A, Palma G, et al.Choosing Between Graph Databases and RDF Engines for Consuming and Mining Linked Data[C]// Proceedings of the 4th International Conference on Consuming Linked Data. 2013.
[42] Hernández D, Hogan A, Riveros C, et al.Querying Wikidata: Comparing SPARQL, Relational and Graph Databases[C]// Proceedings of the 15th International Semantic Web Conference. Springer International Publishing, 2016.
[43] Papailiou N, Konstantinou I, Tsoumakos D, et al.H2RDF: Adaptive Query Processing on RDF Data in the Cloud[C]// Proceedings of the 21st International Conference on World Wide Web. ACM, 2012: 397-400.
[44] Low Y, Gonzalez J, Kyrola A, et al.Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud[J]. Proceedings of the VLDB Endowment, 2012, 5(8): 716-727.
doi: 10.14778/2212351
[45] Avery C.Giraph: Large-scale Graph Processing Infrastructure on Hadoop[C]//Proceedings of the Hadoop Summit. 2011.
[46] Xin R S, Gonzalez J E, Franklin M J, et al.Graphx: A Resilient Distributed Graph System on Spark[C]//Proceedings of the 1st International Workshop on Graph Data Management Experiences and Systems. ACM, 2013: 2.
[47] Koitzsch K.Data Pipelines and How to Construct Them[A]// Pro Hadoop Data Analytics[M]. Apress, 2017: 77-90.
[48] Yi X, Liu F, Liu J, et al.Building a Network Highway for Big Data: Architecture and Challenges[J]. IEEE Network, 2014, 28(4): 5-13.
doi: 10.1109/MNET.2014.6863125
[49] Pedregosa F, Varoquaux G, Gramfort A, et al.Scikit-learn: Machine learning in Python[J]. Journal of Machine Learning Research, 2011, 12: 2825-2830.
[50] Meng X R, Bradley J, Yavuz B, et al.Mllib: Machine Learning in Apache Spark[J]. Journal of Machine Learning Research, 2016, 17(1): 1235-1241.
[51] Apache NiFi. An Easy to Use, Powerful, and Reliable System to Process and Distribute Data[EB/OL]. [2016-12-02]. .
[52] Thusoo A, Sarma J S, Jain N, et al.Hive-A Petabyte Scale Data Warehouse Using Hadoop[C]//Proceedings of the 26th International Conference on Data Engineering(ICDE). IEEE, 2010: 996-1005.
[53] Avram A.Gremlin, A Language for Working with Graphs [EB/OL]. [2016-12-02]..
[54] Wang C, Rayan I A, Schwan K. Faster, Larger, Easier: Reining Real-time Big Data Processing in Cloud[C]// Proceedings of the Posters and Demo Track. ACM, 2012.
[55] Ranawade S V, Navale S, Dhamal A, et al. Online Analytical Processing on Hadoop Using Apache Kylin [EB/OL]. [2016- 12-02].
[56] Li L, Shen Z H, Li J H, et al.A Resilient Index Graph for Querying Large Biological Scientific Data[C]//Proceedings of the 2017 IEEE International Congress on Big Data (BigData Congress). 2017: 435-443.
[57] Carbone P, Katsifodimos A, Ewen S, et al.Apache Flink: Stream and Batch Processing in a Single Engine[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 36(4): 28-38.
[58] Jones M.Process Real-time Big Data with Twitter Storm [EB/OL]. [2016-12-02]..
[59] Apache Beam: An Advanced Unified Programming Model [EB/OL]. [2016-12-02]..
[1] 朝乐门,杨灿军,王盛杰,赵俊鹏,许梦甜. 全球数据科学课程建设现状的实证分析*[J]. 数据分析与知识发现, 2017, 1(6): 12-21.
[2] 崔家旺,李春旺. 基于关联数据的类簇语义揭示模型研究[J]. 数据分析与知识发现, 2017, 1(4): 57-66.
[3] 姜赢,张婧,朱玲萱. 面向Cytoscape平台的关联数据知识图谱概览抽取与可视化*[J]. 数据分析与知识发现, 2017, 1(3): 29-37.
[4] 申雪锋, 柯永振, 姚楠. 多视图合作的联盟数据可视化分析[J]. 数据分析与知识发现, 2017, 1(3): 21-28.
[5] 王汀,高迎,刘经纬. 一种面向中文本体模式的本体对齐框架*[J]. 数据分析与知识发现, 2017, 1(2): 47-57.
[6] 刘睿伦,叶文豪,高瑞卿,唐梦嘉,王东波. 基于大数据岗位需求的文本聚类研究*[J]. 数据分析与知识发现, 2017, 1(12): 32-40.
[7] 齐云飞,赵宇翔,朱庆华. 关联数据在数字图书馆移动视觉搜索系统中的应用研究*[J]. 数据分析与知识发现, 2017, 1(1): 81-90.
[8] 丁恒,陆伟. 标准文献知识服务系统设计与实现*[J]. 现代图书情报技术, 2016, 32(7-8): 120-128.
[9] 岑咏华,王曰芬. 大数据环境下社会舆情分析与决策支持的研究视角和关键问题*[J]. 现代图书情报技术, 2016, 32(7-8): 3-11.
[10] 杨小平,马奇凤,余力,莫雨婷,吴佳楠,张悦. 评论簇在网络舆论中的情感倾向代表性研究*[J]. 现代图书情报技术, 2016, 32(7-8): 51-59.
[11] 杨爱东,刘东苏. 基于Hadoop的微博舆情监控系统模型研究[J]. 现代图书情报技术, 2016, 32(5): 56-63.
[12] 赵夷平,毕强. 关联数据在学术资源网相似文献发现中的应用研究*[J]. 现代图书情报技术, 2016, 32(3): 41-49.
[13] 高骞, 杨旸, 胡广伟, 徐超, 沈高锋, 赵健. 电力大数据驱动的新能源项目投资效益#br# 评价研究*——以Y市电网公司SG-ERP系统为例[J]. 数据分析与知识发现, 2016, 32(12): 57-65.
[14] 杨旸,林辉,胡广伟. 面向光伏项目投资风险的大数据监测指标甄选研究*——以Solarbao平台为例[J]. 现代图书情报技术, 2016, 32(11): 11-19.
[15] 郭振英, 赵文兵, 魏育辉. 轻量级书目本体关联数据建设实践[J]. 现代图书情报技术, 2015, 31(7-8): 139-143.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn