Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (8): 88-97    DOI: 10.11925/infotech.2096-3467.2018.0178
Matching Strategies for Institution Names in Literature Database
Haixia Sun1,2,Lei Wang2,Yingjie Wu2,Weina Hua1,Junlian Li2()
1School of Information Management, Nanjing University, Nanjing 210093, China
2Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100020, China
[Objective] This paper designs and implements matching strategies for institution names in literature database, aiming to regulate their storage and management. [Methods] We first established seven name matching rules based on their regions, types and naming characteristics. Then, we designed four hybrid matching strategies combining rules and Levenstein distance. Finally, we evaluated the four hybrid strategies with institution names from the papers indexed by Chinese Biomedical Literature (CBM) database during 2006-2011. [Results] More than six million affiliation strings from CBM were matched, which included higher education institutions, hospitals and research institutes. We found that the hybrid matching strategy based on region, naming characteristics and Levenstein distance obtained the highest precision (all above 80%), recall (64.82%), and F-value (71.66%). [Limitations] The rules and related dictionary were mainly constructed with human experience and their coverage is limited. There are some errors in the identifying institution names. The proposed strategy cannot address the issues caused by the transformative actions of institutions. [Conclusions] The proposed strategies could improve the performance of scientific research literature databases.

Key wordsInformation Retrieval      Normalization of Affiliation Strings      Similarity Measure      Hybrid Strategy      Literature Database     
Received: 11 February 2018      Published: 08 September 2018

Cite this article:

Haixia Sun,Lei Wang,Yingjie Wu,Weina Hua,Junlian Li. Matching Strategies for Institution Names in Literature Database. Data Analysis and Knowledge Discovery, 2018, 2(8): 88-97.

