Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (1): 27-37    DOI: 10.11925/infotech.2096-3467.2018.1363
  专题 本期目录 | 过刊浏览 | 高级检索 |
科研实体名称规范的研究与实践*
张建勇1,2,钱力1,2,于倩倩1(),董智鹏1,黄永文3,刘建华4,郭舒5,王峰6
1中国科学院文献情报中心 北京 100190
2中国科学院大学图书情报与档案管理系 北京 100190
3中国农业科学院农业信息研究所 北京 100081
4上海科技大学图书馆 上海 201210
5国家互联网应急中心 北京 100029
6中国科学院自动化研究所 北京 100190
Constructing Name Authority for Research Entities
Jianyong Zhang1,2,Li Qian1,2,Qianqian Yu1(),Zhipeng Dong1,Yongwen Huang3,Jianhua Liu4,Shu Guo5,Feng Wang6
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Library, Information and Archives Management, University of Chinese Academy of Sciences, Beijing 100190, China
3Institute of Agricultural Information, Chinese Academy of Agricultural Sciences, Beijing 100081, China
4Library of Shanghai Tech University , Shanghai 201210, China
5National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing 100029, China
6Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
全文: PDF(1870 KB)   HTML ( 3
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】建立机构规范、作者规范、期刊规范、基金规范, 为发现系统、科研实体分析评价等建立数据基础。【方法】以多源异构数据为基础, 对数据进行汇聚和融合, 形成具有唯一标识符的统一的结构化数据。依据名称规范元数据模型, 对科研实体及实体间的关系进行抽取。针对不同的科研实体可获取的文献特征, 制定不同的消歧规则集合, 结合传统字符串匹配方法和深度学习方法进行文本相似度计算。【结果】形成包含260多万条数据的机构规范库、2 300多万条数据的作者规范库、3万多条数据的期刊规范库和200多万条数据的基金规范库。以NSTL机构规范为例, 与InCites机构规范进行对比, 结果显示所遴选的美、英、中3个国家的6所高校, 对标吻合度平均值达到86.8%。【局限】所提出的消歧规则和算法在处理文献特征表达形式多样性方面有待进一步细化和提升;需对具体数据源数据情况进行分析,以选择合适的算法模型。【结论】本研究提出了多源异构数据汇聚融合方法, 设计了科研实体消歧规则和算法, 能够有效实现名称规范数据库建设的规范性和全面性。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
王峰
张建勇
钱力
于倩倩
董智鹏
黄永文
刘建华
郭舒
关键词 名称规范期刊规范机构规范基金规范作者规范    
Abstract

[Objective] This paper aims to construct name authority for authors, institutions, journals, and funding, etc. [Methods] First, we loaded, cleansed, transformed, integrated and merged names from multiple sources to create uniform structured data with unique identifiers. Then, we used the metadata model for name authority to extract research entities and relationships among them. Finally, we proposed disambiguation algorithms, such as Levenshtein Distance, Jaccard, word2vec and CNN, for different research entities. [Results] Our study created name authority databases for authors (23 million records), institutions (2.6 million records), journals (30,000 records), and funding (2 million records). We chose six institutions’ names from NSTL and compared them with those from Incites. We found the average precision reached 86.8%. [Limitations] The proposed disambiguation strategies and algorithms need to be further refined and improved in dealing with the diverse expressions of selected disambiguation feature. The analysis of data from different data sources are needed, in order to apply appropriate algorithms. [Conclusions] The proposed method and disambiguation strategies could improve the performance and comprehensiveness of databases for name authority.

Key wordsName Authority    Journal Authority    Institution Authority    Fund Authority    Author Authority
收稿日期: 2018-12-03     
基金资助:*本文系国家科技图书文献中心(NSTL)资助项目“名称规范数据库建设”(项目编号: 科1817)、中国科学院文献情报中心青年人才领域前沿项目“基于深度学习的名称规范方法研究”(项目编号: G180171001)和中国科学院文献情报中心重点任务专项“科研人员研究方向和研究重点分析”(项目编号: 院1643)的研究成果之一
引用本文:   
张建勇,钱力,于倩倩,董智鹏,黄永文,刘建华,郭舒,王峰. 科研实体名称规范的研究与实践*[J]. 数据分析与知识发现, 2019, 3(1): 27-37.
Jianyong Zhang,Li Qian,Qianqian Yu,Zhipeng Dong,Yongwen Huang,Jianhua Liu,Shu Guo,Feng Wang. Constructing Name Authority for Research Entities. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2018.1363.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.1363
[1] 程颖. 资源发现系统元数据的问题与思考[J]. 图书情报工作, 2015, 59(9): 104-110, 126.
[1] (Cheng Ying.Problem and Thought on the Metadata of Resource Discovery System[J]. Library and Information Service, 2015, 59(9): 104-110, 126.)
[2] Niu J.Evolving Landscape in Name Authority Control[J]. Cataloging & Classification Quarterly, 2013, 51(4): 404-419.
[3] 胡小菁. 规范控制:从名称选择到实体管理[J]. 数字图书馆论坛, 2018(1): 2-7.
[3] (Hu Xiaojing.Authority Control: From Selection of a Name to Entity Management[J]. Digital Library Forum, 2018(1): 2-7.)
[4] Youtie J, Carley S, Porter A L, et al.Tracking Researchers and Their Outputs: New Insights from ORCIDs[J]. Scientometrics, 2017, 113(1): 437-453.
[5] Chávezaragón A, Cruz J F R, Reyesgalaviz O F, et al. An Algorithm to Tackle the Name Authority Control Problem Using Semantic Information[C]// Proceedings of the 2009 Mexican International Conference on Computer Science. IEEE, 2010:176-179.
[6] Fader A, Soderland S, Etzioni O.Scaling Wikipedia-based Named Entity Disambiguation to Arbitrary Web Text[C]// Proceedings of the 2009 IJCAI Workshop on User-contributed Knowledge and Artificial Intelligence: An Evolving Synergy. 2009.
[7] 郎君, 秦兵, 宋巍, 等. 基于社会网络的人名检索结果重名消解[J]. 计算机学报, 2009, 32(7): 1365-1374.
[7] (Lang Jun, Qin Bing, Song Wei, et al.Person Name Disambiguation of Searching Results Using Social Network[J]. Chinese Journal of Computers, 2009, 32(7): 1365-1374.)
[8] 朱小婷. 基于本体的中文人名消歧[D]. 上海: 华东师范大学, 2013.
[8] (Zhu Xiaoting.Chinese Person Name Disambiguation Based on Ontology[D]. Shanghai: East China Normal University, 2013.)
[9] Phillips L B.The Temple and the Bazaar: Wikipedia as a Platform for Open Authority in Museums[J]. The Museum Journal, 2013, 56(2): 219-235.
[10] Kiefer C.SimPack Project Page[EB/OL]. [2018-11-11]..
[11] SecondString Project Page [EB/OL]. [2018-11-11]. .
[12] UK Sheffield University. SimMetrics[EB/OL]. [2018-11-11]. .
[13] 孙海霞, 王蕾, 吴英杰, 等. 科技文献数据库中机构名称匹配策略研究[J]. 数据分析与知识发现, 2018, 2(8): 88-97.
[13] (Sun Haixia, Wang Lei, Wu Yingjie, et al.Matching Strategies for Institution Names in Literature Database[J]. Data Analysis and Knowledge Discovery, 2018, 2(8): 88-97. )
[14] Han H, Giles C L, Zha H, et al.Two Supervised Learning Approaches for Name Disambiguation in Author Citations[C]// Proceedings of the 4th ACM/IEEE Joint Conference on Digital Libraries. 2004: 296-305.
[15] 汪沛, 线岩团, 郭剑毅, 等. 一种结合词向量和图模型的特定领域实体消歧方法[J]. 智能系统学报, 2016, 11(3): 366-374.
[15] (Wang Pei, Xian Yantuan, Guo Jianyi, et al.A Novel Method Using Word Vector and Graphical Models for Entity Disambiguation in Specific Topic Domains[J].CAAI Transactions on Intelligent Systems, 2016, 11(3): 366-374.)
[16] 马晓军, 郭剑毅, 王红斌, 等. 融合词向量和主题模型的领域实体消歧[J]. 模式识别与人工智能, 2017, 30(12): 1130-1137.
[16] (Ma Xiaojun, Guo Jianyi, Wang Hongbin, et al.Entity Disambiguation in Specific Domains Combining Word Vector and Topic Models[J]. Pattern Recognition and Artificial Intelligence, 2017, 30(12): 1130-1137.)
[17] 黄艳芬. FRAD概念模型与CNMARC规范控制[J]. 图书情报工作, 2009, 53(12): 125-128.
[17] (Huang Yanfen.Conception Model of FRAD and Authority Control of CNMARC[J]. Library and Information Service, 2009, 53(12): 125-128.)
[18] 王景侠. 书目框架(BIBFRAME)模型演进分析及启示[J]. 数字图书馆论坛, 2016(10): 67-72.
[18] (Wang Jingxia.Evolution Analysis of BIBFRAME Model and Its Enlightenment[J]. Digital Library Forum, 2016(10): 67-72.)
[19] 张璇. RDA对规范控制思想的阐释及实践革新探析[J]. 图书馆研究与工作, 2017(10): 31-37.
[19] (Zhang Xuan.Exploration of RDA Interpretation of Authority Control and Practice Reform[J]. Library Science Research & Work, 2017(10): 31-37.)
[20] 名称规范元数据标准[EB/OL]. [2018-11-11]. .
[20] (Name Authority Metadata Specification [EB/OL]. [2018-11-11].
[21] Kainulainen J J. Clustering Algorithms: Basics and Visualization[EB/OL]. [2018-11-11]. .
[22] Baidu NLP[EB/OL]. [2018-11-11]..
[23] Zehnalova S, Horak Z, Kudelka M, et al.Evolution of Author’s Topic in Authorship Network[C]// Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012). IEEE Computer Society, 2012.
[24] Newman M E J. Scientific Collaboration Networks. II. Shortest Paths, Weighted Networks, and Centrality[J]. Physical Review E, 2001, 64: 016132.
[25] Newman M E J. Scientific Collaboration Networks. I. Network Constructionand Fundamental Results[J]. Physical Review E, 2001, 64: 016131.
[26] Newman M E J. The Structure of Scientific Collaboration Networks[J]. Proceedings of the National Academy of Sciences of the United States of America, 2000, 98(2): 404-409.
[27] 彭以祺, 吴波尔, 沈仲祺. 国家科技图书文献中心“十三五”发展规划[J]. 数字图书馆论坛, 2016(11): 12-20.
[27] (Peng Yiqi, Wu Boer, Shen Zhongqi.The 13th Five-Year Plan for the Development of National Science and Technology Library[J]. Digital Library Forum, 2016(11): 12-20.)
[28] 张建勇, 曾燕. 文献数据库数据加工规范[M]. 北京: 知识产权出版社, 2009.
[28] (Zhang Jianyong, Zeng Yan.NSTL Literature Data Processing Specification[M]. Beijing: Intellectual Property Publishing House, 2009.)
[29] Web of Science Core Collection Schema [EB/OL]. [2018-10-22]. .
[30] Journal Archiving and Interchange Tag Set Versions[EB/OL]. [2018-10-28]..
[31] 沈仲祺, 张建勇. 文献元数据设计指南和实践[M]. 北京: 科学技术文献出版社, 2017.
[31] (Shen Zhongqi, Zhang Jianyong.Guideline and Practice of Literature Metadata Design[M]. Beijing: Scientific and Technical Documentation Press, 2017.)
[1] 孙海霞,王蕾,吴英杰,华薇娜,李军莲. 科技文献数据库中机构名称匹配策略研究*[J]. 数据分析与知识发现, 2018, 2(8): 88-97.
[2] 郝嘉树. 利用开放语义资源丰富个人名称规范数据——基于FOAF的方案设计[J]. 现代图书情报技术, 2016, 32(2): 75-82.
[3] 白海燕, 刘耀, 郭晓峰. 新型责任者标识系统ORCID的构建机制介绍[J]. 现代图书情报技术, 2015, 31(5): 8-14.
[4] 白海燕. ORCID在机构知识库中的整合介绍[J]. 现代图书情报技术, 2015, 31(3): 8-17.
[5] 陈金星,祝忠明. 责任者名称规范控制研究及进展*[J]. 现代图书情报技术, 2009, 25(12): 12-17.
[6] 刘春红,李凤侠,杨慧. 清华大学图书馆名称规范数据的著录探讨[J]. 现代图书情报技术, 2005, 21(2): 67-70.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn