Please wait a minute...
Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (8): 88-97    DOI: 10.11925/infotech.2096-3467.2018.0178
Current Issue | Archive | Adv Search |
Matching Strategies for Institution Names in Literature Database
Sun Haixia1,2, Wang Lei2, Wu Yingjie2, Hua Weina1, Li Junlian2()
1School of Information Management, Nanjing University, Nanjing 210093, China
2Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100020, China
Download: PDF (687 KB)   HTML ( 2
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper designs and implements matching strategies for institution names in literature database, aiming to regulate their storage and management. [Methods] We first established seven name matching rules based on their regions, types and naming characteristics. Then, we designed four hybrid matching strategies combining rules and Levenstein distance. Finally, we evaluated the four hybrid strategies with institution names from the papers indexed by Chinese Biomedical Literature (CBM) database during 2006-2011. [Results] More than six million affiliation strings from CBM were matched, which included higher education institutions, hospitals and research institutes. We found that the hybrid matching strategy based on region, naming characteristics and Levenstein distance obtained the highest precision (all above 80%), recall (64.82%), and F-value (71.66%). [Limitations] The rules and related dictionary were mainly constructed with human experience and their coverage is limited. There are some errors in the identifying institution names. The proposed strategy cannot address the issues caused by the transformative actions of institutions. [Conclusions] The proposed strategies could improve the performance of scientific research literature databases.

Key wordsInformation Retrieval      Normalization of Affiliation Strings      Similarity Measure      Hybrid Strategy      Literature Database     
Received: 11 February 2018      Published: 08 September 2018
ZTFLH:  TP393  

Cite this article:

Sun Haixia,Wang Lei,Wu Yingjie,Hua Weina,Li Junlian. Matching Strategies for Institution Names in Literature Database. Data Analysis and Knowledge Discovery, 2018, 2(8): 88-97.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.0178     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2018/V2/I8/88

FP FN
LR R1
CR R2
LFR R3-R6 R7
机构分类 关键特征词示例
医院 医院、临床中心、门诊中心、门诊部、…
医学高等教育机构 学院、大学、学校、学部、…
医学科学研究机构 科学院、研究所、研究院、研究中心、创新中心、…
序号 作者机构字符串常见结构 示例
1 ‘机构’+ ‘逗号’+ ‘省份名城市名’+‘邮编’ 昆山市第一人民医院肿瘤科, 江苏昆山 215300
2 ‘机构’+ ‘逗号’+‘城市名’+‘邮编’ 上海复旦大学附属华山医院神外科, 上海 200040
3 ‘机构’+‘逗号’+ ‘省份名’+‘邮编’ 昆山市第一人民医院, 江苏省 215300
4 ‘机构’+‘逗号’+ ‘邮编’ 江苏省南通大学附属肿瘤医院, 226361
5 ‘机构’+ ‘邮编’ 江苏省南通大学附属肿瘤医院 226361
6 ‘机构’ 安徽医科大学第一附属医院消化内科
测试数据集分组 基础数据集合 新增数据集合
序号 CBM收录年份范围 机构类别 去重后机构名称串 序号 CBM收录年份范围 机构类别 去重后机构名称串
第一组(T1) TBD1 2006-2008 高等院校 22 685 TID1 2009-2011 高等院校 10 192
研究所 11 178 研究所 5 182
医院 93 895 医院 59 937
合计 127 758 合计 75 311
第二组(T2) TBD2 2006-2009 高等院校 26 943 TID 2 2010-2011 高等院校 5 932
研究所 13 195 研究所 3 165
医院 113 554 医院 40 281
合计 153 692 合计 49 378
第三组(T2) TBD3 2006-2010 高等院校 31 014 TID3 2011 高等院校 1 862
研究所 15 051 研究所 1 313
医院 133 003 医院 20 833
合计 179 068 合计 24 008
方案 T1 T2 T3
P R F值 P R F值 P R F值
C1 71.15% 62.26% 66.41% 72.68% 68.66% 70.62% 72.79% 74.37% 73.57%
C2 71.23% 60.80% 65.60% 72.23% 66.82% 69.42% 72.45% 72.92% 72.69%
C3 80.72% 53.29% 64.20% 80.56% 59.22% 68.26% 80.11% 64.82% 71.66%
C4 80.77% 51.10% 62.59% 80.46% 57.20% 66.86% 80.00% 63.17% 70.59%
方案 高等院校 科研院所 医院
T1 T2 T3 T1 T2 T3 T1 T2 T3
C1 72.55% 72.37% 68.84% 79.51% 79.64% 77.94% 71.05% 72.33% 72.85
C2 72.35% 71.86% 67.70% 77.33% 78.79% 77.08% 71.05% 72.25% 72.77%
C3 74.43% 74.24% 70.41% 84.91% 85.41% 84.83% 81.75% 81.45% 80.79%
C4 74.44% 74.00% 69.51% 77.46% 79.57% 77.27% 81.61% 81.25% 80.94%
[1] Khalid M A, Jijkoun V, De Rijke M.The Impact of Named Entity Normalization on Information Retrieval for Question Answering[C]//Proceeding of the IR Research, 30th European Conference on Advances in Information Retrieval,Glasgow, UK. Berlin, Heidelberg: Springer-Verlag, 2008: 705-710.
[2] 唐金玲. 国际三大检索系统论文作者机构名称问题研究——以高校机构名称为例[J]. 情报探索, 2014(9): 80-84.
doi: 10.3969/j.issn.1005-8095.2014.09.021
[2] (Tang Jinling.Study on Issues of Author Affiliations on Papers Included in International Three Key Retrieval Systems: Case Study of Name of University[J]. Information Research, 2014(9): 80-84.)
doi: 10.3969/j.issn.1005-8095.2014.09.021
[3] 苏新宁. 图书馆、情报与文献学学术影响力研究报告(2000-2004)——基于CSSCI的分析[J]. 情报学报, 2006, 25(2): 131-153.
doi: 10.3969/j.issn.1000-0135.2006.02.001
[3] (Su Xinning.Report on Academic Influence in Library, Information and Documentation Science (2000-2004)[J]. Journal of the China Society for Scientific and Technical Information, 2006, 25(2): 131-153.)
doi: 10.3969/j.issn.1000-0135.2006.02.001
[4] 曾建勋, 王立学. 面向知识评价的规范文档建设方法[J]. 图书情报工作, 2012, 56(10): 101-106.
[4] (Zeng Jianxun, Wang Lixue.Construction of Knowledge Evaluation-oriented Authority Files[J]. Library and Information Service, 2012, 56(10): 101-106.)
[5] Abramo G, D’Angelo C A, Pugini F. The Measurement of Italian Universities’ Research Productivity by a Non Parametric-Bibliometric Methodology[J]. Scientometrics, 2008, 76(2): 225-244.
doi: 10.1007/s11192-007-1942-2
[6] French J C, Powell A L, Schulman E.Automating the Construction of Authority Files in Digital Libraries: A Case Study[C]//Proceedings of International Conference on Theory and Practice of Digital Libraries.Berlin,Heidelberg: Springer, 1997: 55-71.
[7] Liu W L, Doğan R I, Sun K, et al.Author Name Disambiguation for PubMed[J]. Journal of the Association for Information Science and Technology, 2014, 65(4): 765-781.
doi: 10.1002/asi.23063
[8] 孙海霞, 李军莲. 学术论文作者机构规范文档构建[J]. 医学信息学杂志, 2015, 36(11): 42-47.
doi: 10.3969/j.issn.1673-6036.2015.11.010
[8] (Sun Haixia, Li Junlian.Construction of Authority File of Author Affiliations[J]. Journal of Medical Informatics, 2015, 36(11): 42-47.)
doi: 10.3969/j.issn.1673-6036.2015.11.010
[9] 陈金星, 祝忠明. 责任者名称规范控制研究及进展[J]. 现代图书情报技术, 2009(12): 12-17.
[9] (Chen Jinxing, Zhu Zhongming.Research Progress of the Name Authority Control for the Contributor[J]. New Technology of Library and Information Service, 2009(12): 12-17.)
[10] Jonnalagadda S R, Topham P.NEMO: Extraction and Normalization of Organization Names from PubMed Affiliation String[J]. Journal of Biomedical Discovery and Collaboration, 2010, 5(1): 50-75.
doi: 10.1186/1747-5333-2-2 pmid: 2990275
[11] Jiang Y, Zheng H T, Wang X, et al.Affiliation Disambiguation for Constructing Semantic Digital Libraries[J]. Journal of the American Society for Information Science and Technology, 2011, 62(6): 1029-1041.
doi: 10.1002/asi.21538
[12] Torvik V I, Weeber M, Swanson D R, et al.A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation[J]. Journal of the American Society for Information Science and Technology, 2005, 56(2): 140-158.
doi: 10.1002/asi.20105 pmid: 14728536
[13] Cuxac P, Lamirel J C, Bonvallot V.Efficient Supervised and Semi-Supervised Approaches for Affiliations Disambiguation[J]. Scientometrics, 2013, 97(1): 47-58.
doi: 10.1007/s11192-013-1025-5
[14] French J C, Powell A L, Schulman E.Using Clustering Strategies for Creating Authority Files[J]. Journal of the American Society for Information Science, 2000, 51(8): 774-786.
doi: 10.1002/(ISSN)1097-4571
[15] Huang S, Yang B, Yan S, et al.Institution Name Disambiguation for Research Assessment[J]. Scientometrics, 2014, 99(3): 823-838.
doi: 10.1007/s11192-013-1214-2
[16] 孙海霞, 成颖. 信息集成中的字符串匹配技术研究[J]. 现代图书情报技术, 2007(7): 22-26.
[16] (Sun Haixia, Cheng Ying.Study on String-based Matching of Information Intergration[J]. New Technology of Library and Information Service, 2007(7): 22-26.)
[17] Jacob F, Javed F, Zhao M, et al.sCooL: A System for Academic Institution Name Normalization[C]//Proceeding of 2014 International Conference on Collaboration Technologies & Systems.IEEE, 2014: 86-93.
[18] Bollegala D, Ishizuka M, Matsuo Y.Measuring Ssemantic Similarity Between Words Using Web Search Engines[C]// Proceeding of the 14th International Conference on World Wide Web. 2007: 757-766.
[19] Aumüller D, Rahm E.Web-based Affiliation Matching[C]// Proceeding of International Conference on Information Quality. DBLP, 2009: 246-256.
[20] 杨波, 杨军威, 阎素兰. 基于规则的机构名称规范化研究[J]. 现代图书情报技术, 2015(6): 57-63.
[20] (Yang Bo, Yang Junwei, Yan Sulan.Research on Rule-based Normalization of Institution Name[J]. New Technology of Library and Information Service, 2015(6): 57-63.)
[21] Onodera N, Iwasawa M, Midorikawa N, et al.A Method for Eliminating Articles by Homonymous Authors from the Large Number of Articles Retrieved by Author Search[J]. Journal of the American Society for Information Science and Technology, 2011, 62(4): 677-690.
doi: 10.1002/asi.v62.4
[22] 张小衡, 王玲玲. 中文机构名称的识别与分析[J]. 中文信息学报, 1997, 11(4): 21-32.
[22] (Zhang Xiaoheng, Wang Lingling.Identification and Analysis of Chinese Organization and Institution Names[J]. Journal of Chinese Information Processing, 1997, 11(4): 21-32.)
[23] 中国生物医学文献数据库[EB/OL]. [2017-10-30]. .
[23] (SinoMed[EB/OL]. [2017-10-30].
[1] Mingxuan Huang,Shoudong Lu,Hui Xu. Cross-Language Information Retrieval Based on Weighted Association Patterns and Rule Consequent Expansion[J]. 数据分析与知识发现, 2019, 3(9): 77-87.
[2] Wang Yong,Wang Yongdong,Guo Huifang,Zhou Yumin. Measuring Item Similarity Based on Increment of Diversity[J]. 数据分析与知识发现, 2018, 2(5): 70-76.
[3] Yang Chaofan,Deng Zhonghua,Peng Xin,Liu Bin. Review of Information Retrieval Research: Case Study of Conference Papers[J]. 数据分析与知识发现, 2017, 1(7): 35-43.
[4] Zhang Xiaojuan,Han Yi. Reviews on Temporal Information Retrieval[J]. 数据分析与知识发现, 2017, 1(1): 3-15.
[5] Huang Mingxuan. Cross Language Information Retrieval Model Based on Matrix-weighted Association Patterns Mining[J]. 数据分析与知识发现, 2017, 1(1): 26-36.
[6] Ding Heng,Lu Wei. Building Standard Literature Knowledge Service System[J]. 现代图书情报技术, 2016, 32(7-8): 120-128.
[7] Heng Ding, Wei Lu. A Study on Correlation-based Cross-Modal Information Retrieval[J]. 现代图书情报技术, 2016, 32(1): 17-23.
[8] Mao Jin, Li Gang, Cao Yujie. Re-rank Retrieval Results Through Subject Indexing[J]. 现代图书情报技术, 2014, 30(7): 48-55.
[9] Jiang Shuhao, Xue Fuliang. An Improved Content-based Recommendation Method Through Collaborative Predictions and Fuzzy Similarity Measures[J]. 现代图书情报技术, 2014, 30(2): 41-47.
[10] Qiu Junping, Fang Guoping. The Comparative Analysis of Natural Language Processing Research at Home and Abroad Based on Knowledge Mapping[J]. 现代图书情报技术, 2014, 30(12): 51-61.
[11] Tang Jingxiao,Lv Xueqiang,Liu Chengyang,Li Han. A Hierarchical Framework for User Intention Recognition[J]. 现代图书情报技术, 2014, 30(1): 36-42.
[12] Zhang Mei, Duan Jianyong, Xu Jichao. Person Name Attribute Knowledge Mining and Its Application for Query Classification[J]. 现代图书情报技术, 2013, 29(9): 82-87.
[13] Zhou Shanshan, Bi Qiang, Gao Junfeng. A Method of Information Retrieval Results Visualization Based on Social Network Analysis[J]. 现代图书情报技术, 2013, 29(11): 81-85.
[14] Ma Junhong. A Staged and Integrated Semantic Similarity Algorithm of Text[J]. 现代图书情报技术, 2013, 29(10): 20-26.
[15] Liu Ping, Chen Ye. Survey of the State of the Art in Word Similarity[J]. 现代图书情报技术, 2012, 28(7): 82-89.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn