Big Linked Data Management: Challenges, Solutions and Practices
Shen Zhihong1(), Yao Chang2, Hou Yanfei1, Wu Linhuan3, Li Yuepeng1
1(Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China) 2(National Natural Science Foundation, Beijing 100085, China) 3(Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China)
[Objective] This article analyzed the concept, connotation and characteristics of the big linked data, aiming to explore possible solutions for technical challenges facing its management. [Methods] We proposed a new model based on NoSQL data management, distributed graph computing and big data pipeline technologies, which designed and develop gETL, a large-scale graph data warehouse processing system. [Results] The proposed system was used in NSFC-KBMS and WDCM projects, which effectively manages large-scale knowledge-data and biological data. [Limitations] The proposed system could be improved with new applications. [Conclusions] The NoSQL data storage, distributed graph computing, and big data pipeline technologies, as well as the gETL system, help us address the challenges facing linked big data management.
(Li Jianhui, Shen Zhihong, Meng Xiaofeng.Scientific Big Data Management: Concepts, Technologies and System[J]. Journal of Computer Research and Development, 2017, 54(2): 235-247.)
[6]
Hu B, Carvalho N, Laera L, et al.Towards Big Linked Data: A Large-scale, Distributed Semantic Data Storage[C]// Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services, Bali, Indonesia. New York, USA: ACM, 2012: 167-176.
[7]
Hitzler P, Janowicz K. Linked Data, Big Data,the 4th Paradigm[J]. Semantic Web, 2013, 4(3): 233-235.
[8]
Big Data & Linked Data[EB/OL]. [2017-06-08]. .
[9]
Robak S, Franczyk B, Robak M.Applying Big Data and Linked Data Concepts in Supply Chains Management[C]// Proceedings of the 2013 Federated Conference on Computer Science and Information Systems (FedCSIS). IEEE, 2013: 1215-1221.
(Liu Wei, Xia Cuijuan, Zhang Chunjing.Big Data and Linked Data: The Emerging Data Technology for the Future of Librarianship[J]. New Technology of Library and Information Service, 2013(4): 2-9.
[11]
Erling O, Mikhailov I.Virtuoso: RDF Support in a Native RDBMS[A]//Semantic Web Information Management[M]. Springer, Berlin, Heidelberg, 2010: 501-519.
[12]
Bizer C, Cyganiak R.D2R Server-Publishing Relational Databases on the Semantic Web[C]//Proceedings of the 5th International Semantic Web Conference. 2006.
[13]
Volz J, Bizer C, Gaedke M, et al.Silk - A Link Discovery Framework for the Web of Data[C]//Proceedings of the 2nd Workshop about Linked Data on the Web. 2009.
(Li Juanzi, Hou Lei.Overview of Knowledge Graph[J]. Journal of Shanxi University: Natural Science Edition, 2017, 40(3): 454-459.)
[15]
Auer S, Bizer C, Kobilarov G, et al.DBpedia: A Nucleus for a Web of Open Data[A]// The Semantic Web[M]. Springer, Berlin, Heidelberg, 2007.
[16]
Suchanek F M, Kasneci G, Weikum G.YAGO: A Large Ontology from Wikipedia and Wordnet[J]. Web Semantics: Science, Services and Agents on the World Wide Web, 2008, 6(3): 203-217.
doi: 10.1016/j.websem.2008.06.001
[17]
Vrandečić D, Krötzsch M.Wikidata: A Free Collaborative Knowledgebase[J]. Communications of the ACM, 2014, 57(10): 78-85.
doi: 10.1145/2629489
[18]
知识图谱的应用[EB/OL]. [2017-10-02]. .
[18]
(Application of Knowledge Graph [EB/ OL]. [2017-10-02]. Application of Knowledge Graph [EB/ OL]. [2017-10-02]. . [2017-10-02]. Application of Knowledge Graph [EB/ OL]. [2017-10-02].
[19]
Barwick H. The ‘Four Vs’ of Big Data. Implementing Information Infrastructure Symposium [EB/OL]. [2012-10- 02]. .
[20]
IBM. What is Big Data? [EB/OL]. [2012-10-02]. .
[21]
Cyganiak R, Jentzsch A, Abele A, McCrae J. Linking Open Data Cloud Diagram [EB/OL]. [2016-12-02]. .
[22]
Wu L, Sun Q, Desmeth P, et al.World Data Centre for Microorganisms: An Information Infrastructure to Explore and Utilize Preserved Microbial Strains Worldwide[J]. Nucleic Acids Research, 2017, 45(D1): D611-D618.
doi: 10.1093/nar/gkw903
pmid: 5210620
[23]
Auer S, Demter J, Martin M, et al.Lodstats - An Extensible Framework for High-performance Dataset Analytics[A]// Knowledge Engineering and Knowledge Management[M]. Springer Berlin Heidelberg, 2012: 353-362.
[24]
Dong X, Ding Y, Wang H, et al.Chem2Bio2RDF Dashboard: Ranking Semantic Associations in Systems Chemical Biology Space[C]// Proceedings of the 19th World Wide Web Conference on the Future of the Web in Collaboratice Science(FWCS), Raleigh, NC, USA. 2010.
[25]
Vidal M E, Raschid L, Márquez N, et al.BioNav: An Ontology-Based Framework to Discover Semantic Links in the Cloud of Linked Data[A]// The Semantic Web: Research and Applications[M]. Springer, Berlin, Heidelberg, 2010.
[26]
Hausenblas M. Linked Data Applications[R/OL]. Digital Enterprise Research Institute(DERI), 2009. .
(Xia Cuijuan, Liu Wei.Technologies and Implementation of Consuming Linked Data[J]. Journal of Academic Libraries, 2013, 31(3): 29-37.)
doi: 10.3969/j.issn.1002-1027.2013.03.004
[28]
Slater T, Bouton C, Huang E S.Beyond Data Integration[J]. Drug Discovery Today, 2008, 13(13-14): 584-589.
doi: 10.1016/j.drudis.2008.01.008
(He Shaopeng, Li Jianhui, Shen Zhihong, et al.Overview of the Storage Technology for Large-scale RDF Data[J]. Microcomputer Applications, 2013, 2(1): 8-16.)
doi: 10.3969/j.issn.2095-347X.2013.01.002
[30]
从语义网到知识图谱——语义技术工程化的回顾与反思[EB/OL]. [2016-12-02]..
[30]
(From Semantic Web to Knowledge Graph——Review of the Engineering of Semantic Technology[EB/OL]. [2016-12-02]..)
(Shen Zhihong, Li Jianhui, Zhang Xiaolin.Insights into Link Discovery Process for Linked Open Data: Positioning, Goals and Complexity[J]. Journal of Library Science in China, 2013, 39(6): 101-108.)
doi: 10.3969/j.issn.1001-8867.2013.06.009
[32]
Hassanzadeh O, Lim L, Kementsietsidis, et al. A Declarative Framework for Semantic Link Discovery over Relational Data[C] // Proceedings of the 18th World Wide Web Conference (WWW2009). 2009: 1101-1102.
[33]
Ngomo A C N, Auer S. LIMES: A Time-efficient Approach for Large-scale Link Discovery on the Web of Data[C]// Proceedings of the 22nd International Joint Conference on Artificial Intelligence. 2011: 2312-2317.
[34]
Hassanzadeh O.Publishing Relational Databases as Linked Data [EB/OL]. [2016-12-02].
[35]
Scharffe F, Liu Y, Zhou C. RDF-AI: An Architecture for RDF Datasets Matching, Fusion and Interlink[C]//Proceedings of the IJCAI 2009 Workshop on Identity, Reference, and Knowledge Representation (IR-KR). 2009.
[36]
Cattell R.Scalable SQL and NoSQL Data Stores[J]. ACM SIGMOD Record, 2010, 39(4): 12-27.
[37]
Wang G, Tang J.The NoSQL Principles and Basic Application of Cassandra Model[C]// Proceedings of the 2012 International Conference on Computer Science & Service System (CSSS). 2012: 1332-1335
[38]
Brewer E.CAP Twelve Years Later: How the "Rules" Have Changed[J]. Computer, 2012, 45(2): 23-29.
doi: 10.1109/MC.2012.37
[39]
Webber J.A Programmatic Introduction to Neo4j[C]// Proceedings of the 3rd Annual Conference on Systems, Programming, and Applications: Software for Humanity. ACM, 2012: 217-218.
[40]
Jouili S, Vansteenberghe V.An Empirical Comparison of Graph Databases[C]// Proceedings of the 2013 International Conference on Social Computing (SocialCom). IEEE, 2013: 708-715.
[41]
Abreu D D, Flores A, Palma G, et al.Choosing Between Graph Databases and RDF Engines for Consuming and Mining Linked Data[C]// Proceedings of the 4th International Conference on Consuming Linked Data. 2013.
[42]
Hernández D, Hogan A, Riveros C, et al.Querying Wikidata: Comparing SPARQL, Relational and Graph Databases[C]// Proceedings of the 15th International Semantic Web Conference. Springer International Publishing, 2016.
[43]
Papailiou N, Konstantinou I, Tsoumakos D, et al.H2RDF: Adaptive Query Processing on RDF Data in the Cloud[C]// Proceedings of the 21st International Conference on World Wide Web. ACM, 2012: 397-400.
[44]
Low Y, Gonzalez J, Kyrola A, et al.Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud[J]. Proceedings of the VLDB Endowment, 2012, 5(8): 716-727.
doi: 10.14778/2212351
[45]
Avery C.Giraph: Large-scale Graph Processing Infrastructure on Hadoop[C]//Proceedings of the Hadoop Summit. 2011.
[46]
Xin R S, Gonzalez J E, Franklin M J, et al.Graphx: A Resilient Distributed Graph System on Spark[C]//Proceedings of the 1st International Workshop on Graph Data Management Experiences and Systems. ACM, 2013: 2.
[47]
Koitzsch K.Data Pipelines and How to Construct Them[A]// Pro Hadoop Data Analytics[M]. Apress, 2017: 77-90.
[48]
Yi X, Liu F, Liu J, et al.Building a Network Highway for Big Data: Architecture and Challenges[J]. IEEE Network, 2014, 28(4): 5-13.
doi: 10.1109/MNET.2014.6863125
[49]
Pedregosa F, Varoquaux G, Gramfort A, et al.Scikit-learn: Machine learning in Python[J]. Journal of Machine Learning Research, 2011, 12: 2825-2830.
[50]
Meng X R, Bradley J, Yavuz B, et al.Mllib: Machine Learning in Apache Spark[J]. Journal of Machine Learning Research, 2016, 17(1): 1235-1241.
[51]
Apache NiFi. An Easy to Use, Powerful, and Reliable System to Process and Distribute Data[EB/OL]. [2016-12-02]. .
[52]
Thusoo A, Sarma J S, Jain N, et al.Hive-A Petabyte Scale Data Warehouse Using Hadoop[C]//Proceedings of the 26th International Conference on Data Engineering(ICDE). IEEE, 2010: 996-1005.
[53]
Avram A.Gremlin, A Language for Working with Graphs [EB/OL]. [2016-12-02]..
[54]
Wang C, Rayan I A, Schwan K. Faster, Larger, Easier: Reining Real-time Big Data Processing in Cloud[C]// Proceedings of the Posters and Demo Track. ACM, 2012.
[55]
Ranawade S V, Navale S, Dhamal A, et al. Online Analytical Processing on Hadoop Using Apache Kylin [EB/OL]. [2016- 12-02].
[56]
Li L, Shen Z H, Li J H, et al.A Resilient Index Graph for Querying Large Biological Scientific Data[C]//Proceedings of the 2017 IEEE International Congress on Big Data (BigData Congress). 2017: 435-443.
[57]
Carbone P, Katsifodimos A, Ewen S, et al.Apache Flink: Stream and Batch Processing in a Single Engine[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 36(4): 28-38.
[58]
Jones M.Process Real-time Big Data with Twitter Storm [EB/OL]. [2016-12-02]..
[59]
Apache Beam: An Advanced Unified Programming Model [EB/OL]. [2016-12-02]..