|
|
Big Linked Data Management: Challenges, Solutions and Practices |
Shen Zhihong1( ), Yao Chang2, Hou Yanfei1, Wu Linhuan3, Li Yuepeng1 |
1(Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China) 2(National Natural Science Foundation, Beijing 100085, China) 3(Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China) |
|
|
Abstract [Objective] This article analyzed the concept, connotation and characteristics of the big linked data, aiming to explore possible solutions for technical challenges facing its management. [Methods] We proposed a new model based on NoSQL data management, distributed graph computing and big data pipeline technologies, which designed and develop gETL, a large-scale graph data warehouse processing system. [Results] The proposed system was used in NSFC-KBMS and WDCM projects, which effectively manages large-scale knowledge-data and biological data. [Limitations] The proposed system could be improved with new applications. [Conclusions] The NoSQL data storage, distributed graph computing, and big data pipeline technologies, as well as the gETL system, help us address the challenges facing linked big data management.
|
Received: 12 December 2017
Published: 05 February 2018
|
|
[1] |
Berners-Lee T.Design Issues: Linked Data[EB/OL]. [2017- 12-29]..
|
[2] |
沈志宏, 张晓林. 关联数据及其应用现状综述[J]. 现代图书情报技术, 2011(11): 1-9.
|
[2] |
(Shen Zhihong, Zhang Xiaolin.Linked Data and Its Applications: An Overview[J]. New Technology of Library and Information Service, 2011(11): 1-9.)
|
[3] |
BigData[J]. Nature, 2008, 455(7209): 1-136.
doi: 10.1038/455001a
|
[4] |
Big Data [EB/OL]. [2017-12-29]..
|
[5] |
黎建辉, 沈志宏, 孟小峰. 科学大数据管理: 概念、技术与系统[J]. 计算机研究与发展, 2017, 54(2): 235-247.
|
[5] |
(Li Jianhui, Shen Zhihong, Meng Xiaofeng.Scientific Big Data Management: Concepts, Technologies and System[J]. Journal of Computer Research and Development, 2017, 54(2): 235-247.)
|
[6] |
Hu B, Carvalho N, Laera L, et al.Towards Big Linked Data: A Large-scale, Distributed Semantic Data Storage[C]// Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services, Bali, Indonesia. New York, USA: ACM, 2012: 167-176.
|
[7] |
Hitzler P, Janowicz K. Linked Data, Big Data,the 4th Paradigm[J]. Semantic Web, 2013, 4(3): 233-235.
|
[8] |
Big Data & Linked Data[EB/OL]. [2017-06-08]. .
|
[9] |
Robak S, Franczyk B, Robak M.Applying Big Data and Linked Data Concepts in Supply Chains Management[C]// Proceedings of the 2013 Federated Conference on Computer Science and Information Systems (FedCSIS). IEEE, 2013: 1215-1221.
|
[10] |
刘炜, 夏翠娟, 张春景. 大数据与关联数据: 正在到来的数据技术革命[J]. 现代图书情报技术, 2013(4): 2-9.
|
[10] |
(Liu Wei, Xia Cuijuan, Zhang Chunjing.Big Data and Linked Data: The Emerging Data Technology for the Future of Librarianship[J]. New Technology of Library and Information Service, 2013(4): 2-9.
|
[11] |
Erling O, Mikhailov I.Virtuoso: RDF Support in a Native RDBMS[A]//Semantic Web Information Management[M]. Springer, Berlin, Heidelberg, 2010: 501-519.
|
[12] |
Bizer C, Cyganiak R.D2R Server-Publishing Relational Databases on the Semantic Web[C]//Proceedings of the 5th International Semantic Web Conference. 2006.
|
[13] |
Volz J, Bizer C, Gaedke M, et al.Silk - A Link Discovery Framework for the Web of Data[C]//Proceedings of the 2nd Workshop about Linked Data on the Web. 2009.
|
[14] |
李涓子, 侯磊. 知识图谱研究综述[J].山西大学学报: 自然科学版, 2017, 40(3): 454-459.
|
[14] |
(Li Juanzi, Hou Lei.Overview of Knowledge Graph[J]. Journal of Shanxi University: Natural Science Edition, 2017, 40(3): 454-459.)
|
[15] |
Auer S, Bizer C, Kobilarov G, et al.DBpedia: A Nucleus for a Web of Open Data[A]// The Semantic Web[M]. Springer, Berlin, Heidelberg, 2007.
|
[16] |
Suchanek F M, Kasneci G, Weikum G.YAGO: A Large Ontology from Wikipedia and Wordnet[J]. Web Semantics: Science, Services and Agents on the World Wide Web, 2008, 6(3): 203-217.
doi: 10.1016/j.websem.2008.06.001
|
[17] |
Vrandečić D, Krötzsch M.Wikidata: A Free Collaborative Knowledgebase[J]. Communications of the ACM, 2014, 57(10): 78-85.
doi: 10.1145/2629489
|
[18] |
知识图谱的应用[EB/OL]. [2017-10-02]. .
|
[18] |
(Application of Knowledge Graph [EB/ OL]. [2017-10-02]. Application of Knowledge Graph [EB/ OL]. [2017-10-02]. . [2017-10-02]. Application of Knowledge Graph [EB/ OL]. [2017-10-02].
|
[19] |
Barwick H. The ‘Four Vs’ of Big Data. Implementing Information Infrastructure Symposium [EB/OL]. [2012-10- 02]. .
|
[20] |
IBM. What is Big Data? [EB/OL]. [2012-10-02]. .
|
[21] |
Cyganiak R, Jentzsch A, Abele A, McCrae J. Linking Open Data Cloud Diagram [EB/OL]. [2016-12-02]. .
|
[22] |
Wu L, Sun Q, Desmeth P, et al.World Data Centre for Microorganisms: An Information Infrastructure to Explore and Utilize Preserved Microbial Strains Worldwide[J]. Nucleic Acids Research, 2017, 45(D1): D611-D618.
doi: 10.1093/nar/gkw903
pmid: 5210620
|
[23] |
Auer S, Demter J, Martin M, et al.Lodstats - An Extensible Framework for High-performance Dataset Analytics[A]// Knowledge Engineering and Knowledge Management[M]. Springer Berlin Heidelberg, 2012: 353-362.
|
[24] |
Dong X, Ding Y, Wang H, et al.Chem2Bio2RDF Dashboard: Ranking Semantic Associations in Systems Chemical Biology Space[C]// Proceedings of the 19th World Wide Web Conference on the Future of the Web in Collaboratice Science(FWCS), Raleigh, NC, USA. 2010.
|
[25] |
Vidal M E, Raschid L, Márquez N, et al.BioNav: An Ontology-Based Framework to Discover Semantic Links in the Cloud of Linked Data[A]// The Semantic Web: Research and Applications[M]. Springer, Berlin, Heidelberg, 2010.
|
[26] |
Hausenblas M. Linked Data Applications[R/OL]. Digital Enterprise Research Institute(DERI), 2009. .
|
[27] |
夏翠娟, 刘炜.关联数据的消费技术及实现[J].大学图书馆学报, 2013, 31(3): 29-37.
doi: 10.3969/j.issn.1002-1027.2013.03.004
|
[27] |
(Xia Cuijuan, Liu Wei.Technologies and Implementation of Consuming Linked Data[J]. Journal of Academic Libraries, 2013, 31(3): 29-37.)
doi: 10.3969/j.issn.1002-1027.2013.03.004
|
[28] |
Slater T, Bouton C, Huang E S.Beyond Data Integration[J]. Drug Discovery Today, 2008, 13(13-14): 584-589.
doi: 10.1016/j.drudis.2008.01.008
|
[29] |
何少鹏, 黎建辉, 沈志宏, 等. 大规模的RDF数据存储技术综述[J]. 网络新媒体技术, 2013, 2(1): 8-16.
doi: 10.3969/j.issn.2095-347X.2013.01.002
|
[29] |
(He Shaopeng, Li Jianhui, Shen Zhihong, et al.Overview of the Storage Technology for Large-scale RDF Data[J]. Microcomputer Applications, 2013, 2(1): 8-16.)
doi: 10.3969/j.issn.2095-347X.2013.01.002
|
[30] |
从语义网到知识图谱——语义技术工程化的回顾与反思[EB/OL]. [2016-12-02]..
|
[30] |
(From Semantic Web to Knowledge Graph——Review of the Engineering of Semantic Technology[EB/OL]. [2016-12-02]..)
|
[31] |
沈志宏, 黎建辉, 张晓林. 面向LOD的关联发现过程的定位、目标与复杂性分析[J]. 中国图书馆学报, 2013, 39(6): 101-108.
doi: 10.3969/j.issn.1001-8867.2013.06.009
|
[31] |
(Shen Zhihong, Li Jianhui, Zhang Xiaolin.Insights into Link Discovery Process for Linked Open Data: Positioning, Goals and Complexity[J]. Journal of Library Science in China, 2013, 39(6): 101-108.)
doi: 10.3969/j.issn.1001-8867.2013.06.009
|
[32] |
Hassanzadeh O, Lim L, Kementsietsidis, et al. A Declarative Framework for Semantic Link Discovery over Relational Data[C] // Proceedings of the 18th World Wide Web Conference (WWW2009). 2009: 1101-1102.
|
[33] |
Ngomo A C N, Auer S. LIMES: A Time-efficient Approach for Large-scale Link Discovery on the Web of Data[C]// Proceedings of the 22nd International Joint Conference on Artificial Intelligence. 2011: 2312-2317.
|
[34] |
Hassanzadeh O.Publishing Relational Databases as Linked Data [EB/OL]. [2016-12-02].
|
[35] |
Scharffe F, Liu Y, Zhou C. RDF-AI: An Architecture for RDF Datasets Matching, Fusion and Interlink[C]//Proceedings of the IJCAI 2009 Workshop on Identity, Reference, and Knowledge Representation (IR-KR). 2009.
|
[36] |
Cattell R.Scalable SQL and NoSQL Data Stores[J]. ACM SIGMOD Record, 2010, 39(4): 12-27.
|
[37] |
Wang G, Tang J.The NoSQL Principles and Basic Application of Cassandra Model[C]// Proceedings of the 2012 International Conference on Computer Science & Service System (CSSS). 2012: 1332-1335
|
[38] |
Brewer E.CAP Twelve Years Later: How the "Rules" Have Changed[J]. Computer, 2012, 45(2): 23-29.
doi: 10.1109/MC.2012.37
|
[39] |
Webber J.A Programmatic Introduction to Neo4j[C]// Proceedings of the 3rd Annual Conference on Systems, Programming, and Applications: Software for Humanity. ACM, 2012: 217-218.
|
[40] |
Jouili S, Vansteenberghe V.An Empirical Comparison of Graph Databases[C]// Proceedings of the 2013 International Conference on Social Computing (SocialCom). IEEE, 2013: 708-715.
|
[41] |
Abreu D D, Flores A, Palma G, et al.Choosing Between Graph Databases and RDF Engines for Consuming and Mining Linked Data[C]// Proceedings of the 4th International Conference on Consuming Linked Data. 2013.
|
[42] |
Hernández D, Hogan A, Riveros C, et al.Querying Wikidata: Comparing SPARQL, Relational and Graph Databases[C]// Proceedings of the 15th International Semantic Web Conference. Springer International Publishing, 2016.
|
[43] |
Papailiou N, Konstantinou I, Tsoumakos D, et al.H2RDF: Adaptive Query Processing on RDF Data in the Cloud[C]// Proceedings of the 21st International Conference on World Wide Web. ACM, 2012: 397-400.
|
[44] |
Low Y, Gonzalez J, Kyrola A, et al.Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud[J]. Proceedings of the VLDB Endowment, 2012, 5(8): 716-727.
doi: 10.14778/2212351
|
[45] |
Avery C.Giraph: Large-scale Graph Processing Infrastructure on Hadoop[C]//Proceedings of the Hadoop Summit. 2011.
|
[46] |
Xin R S, Gonzalez J E, Franklin M J, et al.Graphx: A Resilient Distributed Graph System on Spark[C]//Proceedings of the 1st International Workshop on Graph Data Management Experiences and Systems. ACM, 2013: 2.
|
[47] |
Koitzsch K.Data Pipelines and How to Construct Them[A]// Pro Hadoop Data Analytics[M]. Apress, 2017: 77-90.
|
[48] |
Yi X, Liu F, Liu J, et al.Building a Network Highway for Big Data: Architecture and Challenges[J]. IEEE Network, 2014, 28(4): 5-13.
doi: 10.1109/MNET.2014.6863125
|
[49] |
Pedregosa F, Varoquaux G, Gramfort A, et al.Scikit-learn: Machine learning in Python[J]. Journal of Machine Learning Research, 2011, 12: 2825-2830.
|
[50] |
Meng X R, Bradley J, Yavuz B, et al.Mllib: Machine Learning in Apache Spark[J]. Journal of Machine Learning Research, 2016, 17(1): 1235-1241.
|
[51] |
Apache NiFi. An Easy to Use, Powerful, and Reliable System to Process and Distribute Data[EB/OL]. [2016-12-02]. .
|
[52] |
Thusoo A, Sarma J S, Jain N, et al.Hive-A Petabyte Scale Data Warehouse Using Hadoop[C]//Proceedings of the 26th International Conference on Data Engineering(ICDE). IEEE, 2010: 996-1005.
|
[53] |
Avram A.Gremlin, A Language for Working with Graphs [EB/OL]. [2016-12-02]..
|
[54] |
Wang C, Rayan I A, Schwan K. Faster, Larger, Easier: Reining Real-time Big Data Processing in Cloud[C]// Proceedings of the Posters and Demo Track. ACM, 2012.
|
[55] |
Ranawade S V, Navale S, Dhamal A, et al. Online Analytical Processing on Hadoop Using Apache Kylin [EB/OL]. [2016- 12-02].
|
[56] |
Li L, Shen Z H, Li J H, et al.A Resilient Index Graph for Querying Large Biological Scientific Data[C]//Proceedings of the 2017 IEEE International Congress on Big Data (BigData Congress). 2017: 435-443.
|
[57] |
Carbone P, Katsifodimos A, Ewen S, et al.Apache Flink: Stream and Batch Processing in a Single Engine[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 36(4): 28-38.
|
[58] |
Jones M.Process Real-time Big Data with Twitter Storm [EB/OL]. [2016-12-02]..
|
[59] |
Apache Beam: An Advanced Unified Programming Model [EB/OL]. [2016-12-02]..
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|