科技文献数据库中机构名称匹配策略研究<sup>*</sup>

doi:10.11925/infotech.2096-3467.2018.0178

数据分析与知识发现

2018, Vol. 2

Issue (8): 88-97 https://doi.org/10.11925/infotech.2096-3467.2018.0178

应用论文

本期目录 | 过刊浏览 | 高级检索

科技文献数据库中机构名称匹配策略研究^*

孙海霞^1,², 王蕾², 吴英杰², 华薇娜¹, 李军莲²(

)

¹南京大学信息管理学院南京 210093
²中国医学科学院医学信息研究所北京 100020

Matching Strategies for Institution Names in Literature Database

Sun Haixia^1,², Wang Lei², Wu Yingjie², Hua Weina¹, Li Junlian²(

)

¹School of Information Management, Nanjing University, Nanjing 210093, China
²Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100020, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (687 KB) HTML ( 8 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】规范科技文献数据库中机构名称存储与管理, 设计并实现机构名称匹配策略。【方法】引入地区、类别和命名特征, 构建3类7组匹配判定规则, 设计4组规则与编辑距离混合的匹配策略, 基于中文生物医学文献数据库2006年-2011年“作者单位”数据进行实现与评估。【结果】在600余万条“作者单位”数据集上, 对高等院校、医院与科研院所三类机构进行匹配实现, 结果表明综合考虑机构地区和命名特征规则的混合匹配策略表现最佳, 准确率均在80%以上, 召回率达64.82%, F值达71.66%。【局限】辅助词典和规则构建主要依赖人工经验, 覆盖面不全; 机构名称识别存在错误, 对匹配结果产生影响; 提出的匹配策略无法有效解决机构名称形态差异较大的规范问题。【结论】本研究提出一种基于规则和编辑距离的机构名称匹配策略, 能够提高科研文献数据库建设的规范性。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	孙海霞
	王蕾
	吴英杰
	华薇娜
	李军莲

关键词 ：信息检索, 机构名称规范, 相似度计算, 混合策略, 文献数据库

Abstract：

[Objective] This paper designs and implements matching strategies for institution names in literature database, aiming to regulate their storage and management. [Methods] We first established seven name matching rules based on their regions, types and naming characteristics. Then, we designed four hybrid matching strategies combining rules and Levenstein distance. Finally, we evaluated the four hybrid strategies with institution names from the papers indexed by Chinese Biomedical Literature (CBM) database during 2006-2011. [Results] More than six million affiliation strings from CBM were matched, which included higher education institutions, hospitals and research institutes. We found that the hybrid matching strategy based on region, naming characteristics and Levenstein distance obtained the highest precision (all above 80%), recall (64.82%), and F-value (71.66%). [Limitations] The rules and related dictionary were mainly constructed with human experience and their coverage is limited. There are some errors in the identifying institution names. The proposed strategy cannot address the issues caused by the transformative actions of institutions. [Conclusions] The proposed strategies could improve the performance of scientific research literature databases.

Key words： Information Retrieval Normalization of Affiliation Strings Similarity Measure Hybrid Strategy Literature Database

收稿日期: 2018-02-11 出版日期: 2018-09-08

ZTFLH:

TP393

基金资助:*本文系中央级公益性科研院所基本科研业务费专项“基于共现分析的著者机构名称规范机制研究”(项目编号: 2016RC330006)和国家科技图书文献中心“下一代国家科技创新开放知识服务系统”先期研发任务“STKOS自动构建与维护关键技术研究”(项目编号: XQYF0102)的研究成果之一

引用本文:

孙海霞, 王蕾, 吴英杰, 华薇娜, 李军莲. 科技文献数据库中机构名称匹配策略研究^*[J]. 数据分析与知识发现, 2018, 2(8): 88-97.
Sun Haixia,Wang Lei,Wu Yingjie,Hua Weina,Li Junlian. Matching Strategies for Institution Names in Literature Database. Data Analysis and Knowledge Discovery, 2018, 2(8): 88-97.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.0178 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2018/V2/I8/88

整体研究思路

规则分类

部分类别关键特征词示例

机构名称匹配算法流程

作者机构名称字符串常见组成结构

测试数据集统计

不同组合方案在3个测试集中实验评估结果

不同方案在三个测试集下效果变化趋势

不同组合方案在不同类别子集中准确率评估结果

组合方案C₃在三类机构名称不同测试子集中召回率及其变化情况

组合方案C₃在三类机构名称不同测试子集中F值及其变化情况

[1]	Khalid M A, Jijkoun V, De Rijke M.The Impact of Named Entity Normalization on Information Retrieval for Question Answering[C]//Proceeding of the IR Research, 30th European Conference on Advances in Information Retrieval,Glasgow, UK. Berlin, Heidelberg: Springer-Verlag, 2008: 705-710.
[2]	唐金玲. 国际三大检索系统论文作者机构名称问题研究——以高校机构名称为例[J]. 情报探索, 2014(9): 80-84. doi: 10.3969/j.issn.1005-8095.2014.09.021
[2]	(Tang Jinling.Study on Issues of Author Affiliations on Papers Included in International Three Key Retrieval Systems: Case Study of Name of University[J]. Information Research, 2014(9): 80-84.) doi: 10.3969/j.issn.1005-8095.2014.09.021
[3]	苏新宁. 图书馆、情报与文献学学术影响力研究报告(2000-2004)——基于CSSCI的分析[J]. 情报学报, 2006, 25(2): 131-153. doi: 10.3969/j.issn.1000-0135.2006.02.001
[3]	(Su Xinning.Report on Academic Influence in Library, Information and Documentation Science (2000-2004)[J]. Journal of the China Society for Scientific and Technical Information, 2006, 25(2): 131-153.) doi: 10.3969/j.issn.1000-0135.2006.02.001
[4]	曾建勋, 王立学. 面向知识评价的规范文档建设方法[J]. 图书情报工作, 2012, 56(10): 101-106.
[4]	(Zeng Jianxun, Wang Lixue.Construction of Knowledge Evaluation-oriented Authority Files[J]. Library and Information Service, 2012, 56(10): 101-106.)
[5]	Abramo G, D’Angelo C A, Pugini F. The Measurement of Italian Universities’ Research Productivity by a Non Parametric-Bibliometric Methodology[J]. Scientometrics, 2008, 76(2): 225-244. doi: 10.1007/s11192-007-1942-2
[6]	French J C, Powell A L, Schulman E.Automating the Construction of Authority Files in Digital Libraries: A Case Study[C]//Proceedings of International Conference on Theory and Practice of Digital Libraries.Berlin,Heidelberg: Springer, 1997: 55-71.
[7]	Liu W L, Doğan R I, Sun K, et al.Author Name Disambiguation for PubMed[J]. Journal of the Association for Information Science and Technology, 2014, 65(4): 765-781. doi: 10.1002/asi.23063
[8]	孙海霞, 李军莲. 学术论文作者机构规范文档构建[J]. 医学信息学杂志, 2015, 36(11): 42-47. doi: 10.3969/j.issn.1673-6036.2015.11.010
[8]	(Sun Haixia, Li Junlian.Construction of Authority File of Author Affiliations[J]. Journal of Medical Informatics, 2015, 36(11): 42-47.) doi: 10.3969/j.issn.1673-6036.2015.11.010
[9]	陈金星, 祝忠明. 责任者名称规范控制研究及进展[J]. 现代图书情报技术, 2009(12): 12-17.
[9]	(Chen Jinxing, Zhu Zhongming.Research Progress of the Name Authority Control for the Contributor[J]. New Technology of Library and Information Service, 2009(12): 12-17.)
[10]	Jonnalagadda S R, Topham P.NEMO: Extraction and Normalization of Organization Names from PubMed Affiliation String[J]. Journal of Biomedical Discovery and Collaboration, 2010, 5(1): 50-75. doi: 10.1186/1747-5333-2-2 pmid: 2990275
[11]	Jiang Y, Zheng H T, Wang X, et al.Affiliation Disambiguation for Constructing Semantic Digital Libraries[J]. Journal of the American Society for Information Science and Technology, 2011, 62(6): 1029-1041. doi: 10.1002/asi.21538
[12]	Torvik V I, Weeber M, Swanson D R, et al.A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation[J]. Journal of the American Society for Information Science and Technology, 2005, 56(2): 140-158. doi: 10.1002/asi.20105 pmid: 14728536
[13]	Cuxac P, Lamirel J C, Bonvallot V.Efficient Supervised and Semi-Supervised Approaches for Affiliations Disambiguation[J]. Scientometrics, 2013, 97(1): 47-58. doi: 10.1007/s11192-013-1025-5
[14]	French J C, Powell A L, Schulman E.Using Clustering Strategies for Creating Authority Files[J]. Journal of the American Society for Information Science, 2000, 51(8): 774-786. doi: 10.1002/(ISSN)1097-4571
[15]	Huang S, Yang B, Yan S, et al.Institution Name Disambiguation for Research Assessment[J]. Scientometrics, 2014, 99(3): 823-838. doi: 10.1007/s11192-013-1214-2
[16]	孙海霞, 成颖. 信息集成中的字符串匹配技术研究[J]. 现代图书情报技术, 2007(7): 22-26.
[16]	(Sun Haixia, Cheng Ying.Study on String-based Matching of Information Intergration[J]. New Technology of Library and Information Service, 2007(7): 22-26.)
[17]	Jacob F, Javed F, Zhao M, et al.sCooL: A System for Academic Institution Name Normalization[C]//Proceeding of 2014 International Conference on Collaboration Technologies & Systems.IEEE, 2014: 86-93.
[18]	Bollegala D, Ishizuka M, Matsuo Y.Measuring Ssemantic Similarity Between Words Using Web Search Engines[C]// Proceeding of the 14th International Conference on World Wide Web. 2007: 757-766.
[19]	Aumüller D, Rahm E.Web-based Affiliation Matching[C]// Proceeding of International Conference on Information Quality. DBLP, 2009: 246-256.
[20]	杨波, 杨军威, 阎素兰. 基于规则的机构名称规范化研究[J]. 现代图书情报技术, 2015(6): 57-63.
[20]	(Yang Bo, Yang Junwei, Yan Sulan.Research on Rule-based Normalization of Institution Name[J]. New Technology of Library and Information Service, 2015(6): 57-63.)
[21]	Onodera N, Iwasawa M, Midorikawa N, et al.A Method for Eliminating Articles by Homonymous Authors from the Large Number of Articles Retrieved by Author Search[J]. Journal of the American Society for Information Science and Technology, 2011, 62(4): 677-690. doi: 10.1002/asi.v62.4
[22]	张小衡, 王玲玲. 中文机构名称的识别与分析[J]. 中文信息学报, 1997, 11(4): 21-32.
[22]	(Zhang Xiaoheng, Wang Lingling.Identification and Analysis of Chinese Organization and Institution Names[J]. Journal of Chinese Information Processing, 1997, 11(4): 21-32.)
[23]	中国生物医学文献数据库[EB/OL]. [2017-10-30]. .
[23]	(SinoMed[EB/OL]. [2017-10-30].

[1]	韩辉, 刘秀文. 海事适任评估中主观题自动评分技术研究^*[J]. 数据分析与知识发现, 2021, 5(8): 113-121.
[2]	黄名选,蒋曹清,卢守东. 基于词嵌入与扩展词交集的查询扩展^*[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[3]	孟镇,王昊,虞为,邓三鸿,张宝隆. 基于特征融合的声乐分类研究^*[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[4]	李跃艳,王昊,邓三鸿,王伟. 近十年信息检索领域的研究热点与演化趋势研究——基于SIGIR会议论文的分析[J]. 数据分析与知识发现, 2021, 5(4): 13-24.
[5]	吴彦文, 蔡秋亭, 刘智, 邓云泽. 融合多源数据和场景相似度计算的数字资源推荐研究^*[J]. 数据分析与知识发现, 2021, 5(11): 114-123.
[6]	沈喆, 王毅, 姚毅凡, 成颖. 面向学术文献的作者名消歧方法研究综述*[J]. 数据分析与知识发现, 2020, 4(8): 15-27.
[7]	黄名选,卢守东,徐辉. 基于加权关联模式挖掘与规则后件扩展的跨语言信息检索 ^*[J]. 数据分析与知识发现, 2019, 3(9): 77-87.
[8]	关鹏,王曰芬,傅柱. *基于LDA的主题语义演化分析方法研究 ^ ——以锂离子电池领域为例**[J]. 数据分析与知识发现, 2019, 3(7): 61-72.
[9]	杨超凡, 邓仲华, 彭鑫, 刘斌. *近5年信息检索的研究热点与发展趋势综述^——基于相关会议论文的分析**[J]. 数据分析与知识发现, 2017, 1(7): 35-43.
[10]	张晓娟, 韩毅. 时态信息检索研究综述^*[J]. 数据分析与知识发现, 2017, 1(1): 3-15.
[11]	黄名选. 基于矩阵加权关联模式的印尼中跨语言信息检索模型^*[J]. 数据分析与知识发现, 2017, 1(1): 26-36.
[12]	丁恒, 陆伟. 基于相关性的跨模态信息检索研究^*[J]. 现代图书情报技术, 2016, 32(1): 17-23.
[13]	吴丹, 向雪. 社群环境下的协同信息检索行为实验研究[J]. 现代图书情报技术, 2014, 30(12): 1-9.
[14]	邱均平, 方国平. 基于知识图谱的中外自然语言处理研究的对比分析[J]. 现代图书情报技术, 2014, 30(12): 51-61.
[15]	吴丹,余文婷. 国外协同信息检索系统比较分析^*[J]. 现代图书情报技术, 2014, 30(1): 14-23.

Viewed

Full text

Abstract

Cited

Shared

Discussed