Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (2): 141-150    DOI: 10.11925/infotech.2096-3467.2022.0328
Current Issue | Archive | Adv Search |
Constructing Large-scale Knowledge Graph for Massive Sci-Tech Literature
Du Yue1,2,Chang Zhijun1,2(),Dong Mei1,2,Qian Li1,2,Wang Ying1
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
Download: PDF (1223 KB)   HTML ( 36
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper builds a large-scale knowledge graph for scientific research, which meets the needs of sci-tech information services and improves the data consistency of traditional models. [Methods] First, we proposed an implicit knowledge graph construction method. Then, we used the identification tools for entity feature fields and implicit relationships to continuously update entities and discover entity relationship. [Results] We examined the proposed model with big data platform for PB-level sci-tech literature. Once there are changes in the entity data, the implicit knowledge graph will only update the entity data and will not modify their relationship. The model could retrieve all scholars from one institution through the predefined interface, and the average processing time was one hundredth of the triple-type knowledge graph. [Limitations] It is difficult to solidify the situation not satisfying the implicit relational data structure, and the entity data must be stored in a technical cluster with search engine. [Conclusions] The proposed method could effectively improve the data consistency issue due to changes in entity information. It helps us construct large-scale scientific research knowledge graph, which benefits the management, dissemination and utilization of sci-tech knowledge.

Key wordsKnowledge Graph      Data Consistency      Sci-Tech Big Data     
Received: 11 April 2022      Published: 28 March 2023
ZTFLH:  TP391 G350  
Fund:Literature and Information Capacity Building Project of Chinese Academy of Sciences(Y9100901)
Corresponding Authors: Chang Zhijun,ORCID:0000-0001-9211-8599,E-mail: changzj@mail.las.ac.cn。   

Cite this article:

Du Yue, Chang Zhijun, Dong Mei, Qian Li, Wang Ying. Constructing Large-scale Knowledge Graph for Massive Sci-Tech Literature. Data Analysis and Knowledge Discovery, 2023, 7(2): 141-150.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0328     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I2/141

The Construction Process of Implicit Knowledge Graph
科技文献数据 特征字段集合
学者数据 姓名、所属机构、所属国家
机构数据 机构名称、所属国家、所属城市
期刊数据 期刊名、期刊ISSN、所属城市、隶属单位
论文数据 论文ID、学者列表、机构列表、规范化标题、发表年、期刊名、期刊ISSN
专利数据 专利名称、发明人、发明人国家、发明人省份、专利公开号、专利代理机构
项目数据 项目名、所属国家、所属城市
标准数据 标准号、标准名称、起草人、起草单位
Meta-definition of Entity Features
Implicit Knowledge Graph Entities and Entity Relationships
Implicit Relationship Discovery Based on Entity Features
数据类型 规模 存储量
科技论文 143 753 912 8.79TB
专利 86 565 778 3.34TB
科研人员 96 502 342 1.08TB
机构 11 928 538 47.0GB
期刊 76 107 0.99GB
基金项目 5 221 316 276GB
Statistics of Scientific Research Knowledge Graph
Modification of Implicit Knowledge Graph Data Caused by Institutional Change
RDF Data Modification Caused by Institutional Change
学者示例 检索结果数/条 隐式知识图谱耗时/ms 三元组式知识图谱耗时/ms
学者1 52 5 2 313
学者2 35 6 1 192
学者3 13 5 948
学者4 14 5 1 034
学者5 16 6 820
学者6 34 6 995
学者7 38 6 800
学者8 17 7 719
学者9 13 7 774
学者10 13 7 780
Retrieval Performance on Different Types of Knowledge Graphs for Journals in which Scholar Published
机构示例 检索结果数/条 隐式知识图谱耗时/ms 三元组式知识图谱耗时/ms
机构1 2 716 28 2 293
机构2 72 209 36 3 515
机构3 64 573 37 2 612
机构4 59 247 33 2 256
机构5 26 391 29 1 876
机构6 36 676 28 1 964
机构7 13 359 31 1 864
机构8 2 503 23 1 414
机构9 56 169 33 1 832
机构10 8 563 37 1 044
Retrieval Performance on Different Types of Knowledge Graphs for Academics of the Institution
学者示例 检索结果数/条 隐式知识图谱耗时/ms 三元组式知识图谱耗时/ms
学者1 219 562 792
学者2 131 672 370
学者3 62 122 183
学者4 70 632 220
学者5 84 433 166
学者6 178 748 459
学者7 129 135 321
学者8 173 562 305
学者9 48 430 111
学者10 132 404 393
Retrieval Performance on Different Types of Knowledge Graphs for all Publications Published by Scholar
指标 隐式知识图谱 三元组式知识图谱
数据一致性 无数据一致性问题 困难,保持数据一致性需要较大计算量、检查机制、复杂度高
数据修改即时查看 容易,可随改随看,平均时效较三元组形式好 困难,需要建设关系网络的同步程序,故障的修复技术成本高,时效性较差
关系检索性能 一般,检索相对复杂度高 检索性能较好
关系变更成本 成本低 成本高
对检索能力的要求
对存储的要求 低,只存储实体特征值 高,关系数据量巨大
维护成本 维护数据库及视图接口 维护数据库
使用方便性 方便,调用接口 中等,读取实体、关系数据
Implicit Knowledge Graph vs RDF Knowledge Graph
[1] 李娇. 基于知识图谱的科研综述生成研究[D]. 北京: 中国农业科学院, 2021.
[1] (Li Jiao. Research on Generation of Scientific Research Review Based on Knowledge Graph[D]. Beijing: Chinese Academy of Agricultural Sciences, 2021.)
[2] 田俊峰, 王彦骉, 何欣枫, 等. 数据因果一致性研究综述[J]. 通信学报, 2020, 41(3):154-167.
doi: 10.11959/j.issn.1000-436x.2020055
[2] (Tian Junfeng, Wang Yanbiao, He Xinfeng, et al. Survey on the Causal Consistency of Data[J]. Journal on Communications, 2020, 41(3): 154-167.)
doi: 10.11959/j.issn.1000-436x.2020055
[3] Sowa J F. Principles of Semantic Networks:Exploration in the Representation of Knowledge[A]// The Morgan Kaufmann Series in Representation and Reasoning[M]. Morgan Kaufmann, 1991.
[4] Berners-Lee T, Hendler J, Lassila O. The Semantic Web: A New Form of Web Content That is Meaningful to Computers will Unleash a Revolution of New Possibilities[J]. Scientific American, 2001, 284(5):34-43.
[5] Bizer C, Heath T, Berners-Lee T. Linked Data - The Story So Far[J]. International Journal on Semantic Web and Information Systems, 2009, 5(3). DOI: 10.4018/jswis.2009081901.
doi: 10.4018/jswis.2009081901
[6] Singhal A. Introducing the Knowledge Graph: Things, Not Strings[OL]. (2021-05-16).[2022-06-01]. https://www.blog.google/products/search/introducing-knowledge-graph-things-not/.
[7] 搜狗知立方[DB/OL]. (2017-03-06). [2022-06-01]. https://www.sogou.com/.
[8] Niu X, Sun X R, Wang H F, et al. Zhishi. me — Weaving Chinese Linking Open Data[C]// Proceedings of the 2011 International Semantic Web Conference, LNCS 7032. Berlin:Springer, 2011:205-220.
[9] Miller G A. WordNet: A Lexical Database for English[J]. Communications of the ACM, 1995, 38(11): 39-41.
[10] Auer S, Bizer C, Kobilarov G, et al. DBpedia: A Nucleus for a Web of Open Data[C]// Proceedings of the International Semantic Web Conference and Asian Semantic Web Conference. Berlin, Heidelberg:Springer, 2007: 722-735.
[11] Bollacker K, Evans C, Paritosh P, et al. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge[C]// Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, Canada. New York, USA: ACM, 2008: 1247-1250.
[12] DrugBank[EB/OL]. [2022-06-01]. https://go.drugbank.com/.
[13] Rospocher M, van Erp M, Vossen P, et al. Building Event-centric Knowledge Graphs from News[J]. Journal of Web Semantics, 2016, 37: 132-151.
[14] Springer Nature. SciGraph[EB/OL]. [2022-03-28]. https://www.springernature.com/gp/researchers/scigraph.
[15] Microsoft Academic[EB/OL]. [2022-06-01]. https://academic.microsoft.com/.
[16] AMiner[EB/OL]. [2022-03-28]. https://aminer.org/.
[17] Tang J, Zhang J, Yao L M, et al. ArnetMiner: Extraction and Mining of Academic Social Networks[C]// Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2008: 990-998.
[18] Acemap[EB/OL]. [2022-03-28]. http://acemap.sjtu.edu.cn/.
[19] Wizdom.ai[EB/OL]. [2022-06-01]. https://www.wizdom.ai/.
[20] 徐雷, 潘珺. 科学出版物语义数据及其应用研究[J]. 中国科技期刊研究, 2018, 29(7):704-710.
doi: 10.11946/cjstp.201803070189
[20] (Xu Lei, Pan Jun. Semantic Data of Scientific Publications and their Applications[J]. Chinese Journal of Scientific and Technical Periodicals, 2018, 29(7):704-710.)
doi: 10.11946/cjstp.201803070189
[21] 王鑫, 邹磊, 王朝坤, 等. 知识图谱数据管理研究综述[J]. 软件学报, 2019, 30(7): 2139-2174.
[21] (Wang Xin, Zou Lei, Wang Chaokun, et al. Research on Knowledge Graph Data Management: A Survey[J]. Journal of Software, 2019, 30(7): 2139-2174.)
[22] Wilkinson K, Sayers C, Kuno H, et al. Efficient RDF Storage and Retrieval in Jena2[C]// Proceedings of the 1st International Conference on Semantic Web and Databases. Aachen, Germany: CEUR-WS, 2003, 3:120-139.
[23] Eclipse. RDF4J[EB/OL]. [2022-03-28]. http://rdf4j.org/.
[24] Neumann T, Weikum G. RDF-3X: A RISC-style Engine for RDF[J]. Proceedings of the VLDB Endowment, 2008, 1(1): 647-659.
doi: 10.14778/1453856.1453927
[25] Zou L, Özsu M T, Chen L, et al. GStore: A Graph-based SPARQL Query Engine[J]. The VLDB Journal, 2014, 23(4): 565-590.
doi: 10.1007/s00778-013-0337-7
[26] OpenLink Virtuoso[EB/OL]. [2022-03-28]. https://virtuoso.openlinksw.com/.
[27] AllegroGraph[EB/OL]. [2022-03-28]. https://franz.com/agraph/allegrograph/.
[28] Ontotext. GraphDB[EB/OL]. [2022-03-28]. http://graphdb.ontotext.com/.
[29] Blazegraph[EB/OL]. [2022-03-28]. https://www.blazegraph.com/.
[30] The Neo4j Manual v3.4[EB/OL]. [2022-03-28]. https://neo4j.com/docs/developer-manual/current/.
[31] JanusGraph—Distributed Graph Database[EB/OL]. [2022-03-28]. http://janusgraph.org/.
[32] OrientDB-multi-model Database[EB/OL]. [2022-03-28]. http://orientdb.com/.
[1] Zhang Zhengang, Yu Chuanming. Knowledge Graph Completion Model Based on Entity and Relation Fusion[J]. 数据分析与知识发现, 2023, 7(2): 15-25.
[2] Peng Cheng, Zhang Chunxia, Zhang Xin, Guo Jingtao, Niu Zhendong. Reasoning Model for Temporal Knowledge Graph Based on Entity Multiple Unit Coding[J]. 数据分析与知识发现, 2023, 7(1): 138-149.
[3] Zhang Han, An Xinyu, Liu Chunhe. Building Multi-Source Semantic Knowledge Graph for Drug Repositioning[J]. 数据分析与知识发现, 2022, 6(7): 87-98.
[4] Liu Chunjiang, Li Shuying, Hu Hanlin, Fang Shu. Graph Databases for Complex Network Analysis[J]. 数据分析与知识发现, 2022, 6(7): 1-11.
[5] Liu Kan, Xu Qinya, Yu Lu. Constructing Knowledge Graph for Business Environment[J]. 数据分析与知识发现, 2022, 6(4): 82-96.
[6] Zhang Wei, Wang Hao, Chen Yuetong, Fan Tao, Deng Sanhong. Identifying Metaphors and Association of Chinese Idioms with Transfer Learning and Text Augmentation[J]. 数据分析与知识发现, 2022, 6(2/3): 167-183.
[7] Liu Zhenghao, Qian Yuxing, Yi Tianlong, Lv Huakui. Constructing Knowledge Graph for Financial Securities and Discovering Related Stocks with Knowledge Association[J]. 数据分析与知识发现, 2022, 6(2/3): 184-201.
[8] Cheng Zijia, Chen Chong. Question Comprehension and Answer Organization for Scientific Education of Epidemics[J]. 数据分析与知识发现, 2022, 6(2/3): 202-211.
[9] Hou Dang, Fu Xiangling, Gao Songfeng, Peng Lei, Wang Youjun, Song Meiqi. Mining Enterprise Associations with Knowledge Graph[J]. 数据分析与知识发现, 2022, 6(2/3): 212-221.
[10] Li Zhijie, Wang Rui, Li Changhua, Zhang Jie. Embedding Knowledge Graph with Negative Sampling and Joint Relational Contexts[J]. 数据分析与知识发现, 2022, 6(12): 90-98.
[11] Hua Bin,Kang Yue,Fan Linhao. Knowledge Modeling and Association Q&A for Policy Texts[J]. 数据分析与知识发现, 2022, 6(11): 79-92.
[12] Zhou Yang,Li Xuejun,Wang Donglei,Chen Fang,Peng Lijuan. Visualizing Knowledge Graph for Explosive Formula Design[J]. 数据分析与知识发现, 2021, 5(9): 42-53.
[13] Shen Kejie, Huang Huanting, Hua Bolin. Constructing Knowledge Graph with Public Resumes[J]. 数据分析与知识发现, 2021, 5(7): 81-90.
[14] Ruan Xiaoyun,Liao Jianbin,Li Xiang,Yang Yang,Li Daifeng. Interpretable Recommendation of Reinforcement Learning Based on Talent Knowledge Graph Reasoning[J]. 数据分析与知识发现, 2021, 5(6): 36-50.
[15] Li He,Liu Jiayu,Li Shiyu,Wu Di,Jin Shuaiqi. Optimizing Automatic Question Answering System Based on Disease Knowledge Graph[J]. 数据分析与知识发现, 2021, 5(5): 115-126.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn