1National Science Library, Chinese Academy of Sciences, Beijing 100190, China 2Department of Library, Information and Archives Management, University of Chinese Academy of Sciences, Beijing 100190, China 3Institute of Agricultural Information, Chinese Academy of Agricultural Sciences, Beijing 100081, China 4Library of Shanghai Tech University , Shanghai 201210, China 5National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing 100029, China 6Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
[Objective] This paper aims to construct name authority for authors, institutions, journals, and funding, etc. [Methods] First, we loaded, cleansed, transformed, integrated and merged names from multiple sources to create uniform structured data with unique identifiers. Then, we used the metadata model for name authority to extract research entities and relationships among them. Finally, we proposed disambiguation algorithms, such as Levenshtein Distance, Jaccard, word2vec and CNN, for different research entities. [Results] Our study created name authority databases for authors (23 million records), institutions (2.6 million records), journals (30,000 records), and funding (2 million records). We chose six institutions’ names from NSTL and compared them with those from Incites. We found the average precision reached 86.8%. [Limitations] The proposed disambiguation strategies and algorithms need to be further refined and improved in dealing with the diverse expressions of selected disambiguation feature. The analysis of data from different data sources are needed, in order to apply appropriate algorithms. [Conclusions] The proposed method and disambiguation strategies could improve the performance and comprehensiveness of databases for name authority.
(Cheng Ying.Problem and Thought on the Metadata of Resource Discovery System[J]. Library and Information Service, 2015, 59(9): 104-110, 126.)
Niu J.Evolving Landscape in Name Authority Control[J]. Cataloging & Classification Quarterly, 2013, 51(4): 404-419.
胡小菁. 规范控制:从名称选择到实体管理[J]. 数字图书馆论坛, 2018(1): 2-7.
(Hu Xiaojing.Authority Control: From Selection of a Name to Entity Management[J]. Digital Library Forum, 2018(1): 2-7.)
Youtie J, Carley S, Porter A L, et al.Tracking Researchers and Their Outputs: New Insights from ORCIDs[J]. Scientometrics, 2017, 113(1): 437-453.
Chávezaragón A, Cruz J F R, Reyesgalaviz O F, et al. An Algorithm to Tackle the Name Authority Control Problem Using Semantic Information[C]// Proceedings of the 2009 Mexican International Conference on Computer Science. IEEE, 2010:176-179.
Fader A, Soderland S, Etzioni O.Scaling Wikipedia-based Named Entity Disambiguation to Arbitrary Web Text[C]// Proceedings of the 2009 IJCAI Workshop on User-contributed Knowledge and Artificial Intelligence: An Evolving Synergy. 2009.
(Wang Pei, Xian Yantuan, Guo Jianyi, et al.A Novel Method Using Word Vector and Graphical Models for Entity Disambiguation in Specific Topic Domains[J].CAAI Transactions on Intelligent Systems, 2016, 11(3): 366-374.)
(Ma Xiaojun, Guo Jianyi, Wang Hongbin, et al.Entity Disambiguation in Specific Domains Combining Word Vector and Topic Models[J]. Pattern Recognition and Artificial Intelligence, 2017, 30(12): 1130-1137.)
Kainulainen J J. Clustering Algorithms: Basics and Visualization[EB/OL]. [2018-11-11]. .
Baidu NLP[EB/OL]. [2018-11-11]..
Zehnalova S, Horak Z, Kudelka M, et al.Evolution of Author’s Topic in Authorship Network[C]// Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012). IEEE Computer Society, 2012.
Newman M E J. Scientific Collaboration Networks. II. Shortest Paths, Weighted Networks, and Centrality[J]. Physical Review E, 2001, 64: 016132.
Newman M E J. Scientific Collaboration Networks. I. Network Constructionand Fundamental Results[J]. Physical Review E, 2001, 64: 016131.
Newman M E J. The Structure of Scientific Collaboration Networks[J]. Proceedings of the National Academy of Sciences of the United States of America, 2000, 98(2): 404-409.