Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (4): 90-102    DOI: 10.11925/infotech.2096-3467.2020.0532
Current Issue | Archive | Adv Search |
Disambiguation of Chinese Author Names with Multiple Features
Lin Kerou,Wang Hao(),Gong Lijuan,Zhang Baolong
School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
Download: PDF (1086 KB)   HTML ( 7
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to address the issues facing document management systems due to Chinese authors with the same names. [Methods] We built author entities with “author name + institution name” based on bibliographic data. Then, we used the attributes of author entities to construct six similarity features from three aspects. Third, we merged these features by principal component analysis or direct weight assignment. Finally, we evaluated the performance of the proposed method. [Results] Our methods significantly reduced processing time. Their F1 values on the LIS dataset were 70.74% and 70.42%, while their F1 values on the economics dataset were 81.90% and 80.93%. [Limitations] The attributes used in this research were only retrieved from metadata of the papers. [Conclusions] The proposed method could improve weight setting of multiple features.

Key wordsFeature Fusion      Author Name Disambiguation      PCA      Chinese Papers     
Received: 08 June 2020      Published: 10 October 2020
ZTFLH:  分类号: TP393  
Fund:“Six Talent Peaks” Project in Jiangsu Province(JY-001);Jiangsu Young Talents in Social Sciences, the Tang Scholar of Nanjing University
Corresponding Authors: Wang Hao     E-mail: ywhaowang@nju.edu.cn

Cite this article:

Lin Kerou,Wang Hao,Gong Lijuan,Zhang Baolong. Disambiguation of Chinese Author Names with Multiple Features. Data Analysis and Knowledge Discovery, 2021, 5(4): 90-102.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0532     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I4/90

Research Framework
AU IN CA CI IA TI KW
陈婧 湖南师范大学 王知津/张收棉/张素芳/严贝妮/周贺来 南开大学/国家图书馆 蔡骐/陈婧/陈君莲/陈万忠/党洪莉/党美锦… 企业竞争情报作战室运行准备机制研究/政府公共决策支持系统的构建 企业竞争情报/竞争情报作战室/运行准备/公共决策/决策支持/信息系统
An Example of the Description Structure of the Entity to be Disambiguated
机构信息方面特征
(SIIN)
合作信息方面特征(SCIN) 主题信息方面特征(SSIN)
机构作者相似度(SIA)
机构名相似度(SIN)
合作者相似度(SCA)
合作机构相似度(SCI)
标题相似度(STI)
关键词相似度(SKW)
Summary of All Features
测试集 序号 AU IN1 IN2 SIA SIN SCA SCI STI SKW
LIS测试集 1 安新颖 中国医学科学院 中国医学科学院医学信息研究所 0.845 1.000 0.612 0.000 0.204 0.000
2 安新颖 中国科学院文献情报中心 中国医学科学院 0.442 0.000 0.250 0.707 0.226 0.120
3 安新颖 黑龙江大学 中国科学院文献情报中心 0.331 0.000 0.408 0.000 0.000 0.000
1281 张梅 南京大学 西安文理学院图书馆 0.175 0.000 0.000 0.000 0.000 0.000
1282 张梅 解放军医学图书馆 南京大学 0.298 0.000 0.000 0.000 0.000 0.000
1283 张梅 福建师范大学 河北大学 0.214 0.000 0.000 0.000 0.000 0.000
ECO测试集 1 张巍 吉林财经大学 上海理工大学 0.203 0.000 0.000 0.000 0.086 0.000
2 张巍 吉林财经大学 财政部企业司 0.308 0.000 0.000 0.000 0.052 0.000
3 张巍 吉林财经大学 华信惠悦中国投资咨询部 0.335 0.000 0.000 0.000 0.000 0.000
1736 姜松 重庆理工大学经济金融学院 重庆理工大学 0.431 1.000 0.500 0.000 0.234 0.000
1737 姜松 重庆理工大学经济金融学院 西南大学经济管理学院 0.235 0.236 0.267 0.577 0.243 0.102
1738 姜松 重庆理工大学 西南大学经济管理学院 0.191 0.000 0.267 0.000 0.178 0.102
The Similarity Calculation Results of the Entities to be Disambiguated in Test Dataset
测试集 序号 AU IN1 IN2 judge FS
LIS测试集 1 张静 互联网实验室 浙江传媒学院互联网与社会研究中心 1 4.490
2 王斌 河南工业大学 河南工业大学管理学院 1 4.456
3 张静 东北林业大学 东北林业大学图书馆 1 4.099
4 白献阳 河北大学管理学院 中国人民大学 1 3.926
5 刘冬梅 天津理工大学 天津理工大学图书馆 1 3.604
6 刘华 上海大学 上海大学图书馆 1 3.419
7 安新颖 中国医学科学院 中国医学科学院医学信息研究所 1 3.383
8 鄢小燕 中国科学院 中国科学院国家科学图书馆成都分馆 1 3.353
9 张静 西安交通大学图书馆采编部 西安交通大学图书馆副研究馆员 1 3.253
10 李广建 北京大学 北京大学信息管理系 1 3.112
ECO测试集 1 李建军 新疆大学新疆创新管理研究中心 新疆大学经济与管理学院 1 8.708
2 丁慧 南京大学商学院 南京大学商学院经济学系 1 5.328
3 严良 中国地质大学[武汉]经济管理学院 中国地质大学(武汉)经济管理学院 1 4.562
4 刘杨 天水师范学院经管学院 天水师范学院经济与社会管理学院 1 4.325
5 李军 中铝财务有限责任公司 对外经济贸易大学国际商学院 1 4.286
6 贾康 财政部财政科研所 财政部财科所 1 4.243
7 姜松 重庆理工大学经济与贸易学院 重庆理工大学经济金融学院 1 4.142
8 刘伟 北京大学 北京大学经济学院 1 3.793
9 侯鹏 北京林业大学经济管理学院 北京林业大学 1 3.767
10 张云 南开大学经济学院经济系 南开大学经济学院 1 3.693
The Top 10 Results of Sorted FS
The Evaluation Results of Fusing All Features by PCA on LIS Test Dataset
The Evaluation Results of Fusing All Features by PCA on ECO Test Dataset
测试集 序号 去除的方面 WZSIA WZSIN WZSCA WZSCI WZSTI WZSKW R_threshold P/% R/% F1/%
LIS测试集 1 0.23 0.27 0.12 0.02 0.16 0.20 0.50 81.93 50.00 62.10
2 SIIN 0.00 0.00 0.18 0.10 0.36 0.36 0.65 52.07 64.71 57.70
3 SCIN 0.40 0.38 0.00 0.00 0.08 0.14 0.40 81.82 39.71 53.47
4 SSIN 0.38 0.39 0.18 0.06 0.00 0.00 0.45 89.71 44.85 59.80
ECO测试集 1 0.08 0.15 0.23 0.15 0.18 0.20 0.65 87.10 65.32 74.65
2 SIIN 0.00 0.00 0.29 0.21 0.24 0.26 0.50 82.67 50.00 62.31
3 SCIN 0.12 0.25 0.00 0.00 0.31 0.32 0.65 69.23 65.32 67.22
4 SSIN 0.16 0.24 0.34 0.27 0.00 0.00 0.70 92.55 70.16 79.82
The Highest F1 Value of Fusing Features by PCA after Removing One Single Aspect Features
测试集 序号 去除的特征 WZSIA WZSIN WZSCA WZSCI WZSTI WZSKW R_threshold P/% R/% F1/%
LIS测试集 1 0.23 0.27 0.12 0.02 0.16 0.20 0.50 81.93 50.00 62.10
2 SIA 0.00 0.27 0.14 0.04 0.26 0.29 0.65 61.81 65.44 63.57
3 SIN 0.14 0.00 0.15 0.08 0.31 0.32 0.65 53.01 64.71 58.28
4 SCA 0.04 0.10 0.00 0.26 0.32 0.29 0.65 52.69 64.71 58.09
5 SCI 0.32 0.30 0.17 0.00 0.08 0.13 0.45 91.04 44.85 60.10
6 STI 0.29 0.33 0.14 0.02 0.00 0.21 0.50 83.95 50.00 62.67
7 SKW 0.31 0.33 0.16 0.05 0.16 0.00 0.50 82.93 50.00 62.39
ECO测试集 1 0.08 0.15 0.23 0.15 0.18 0.20 0.65 87.10 65.32 74.65
2 SIA 0.00 0.17 0.25 0.17 0.20 0.22 0.65 84.38 65.32 73.64
3 SIN 0.10 0.00 0.27 0.19 0.21 0.23 0.50 84.93 50.00 62.94
4 SCA 0.10 0.21 0.00 0.17 0.26 0.26 0.70 70.73 70.16 70.45
5 SCI 0.10 0.19 0.25 0.00 0.21 0.24 0.70 79.09 70.16 74.36
6 STI 0.11 0.19 0.28 0.19 0.00 0.22 0.70 90.63 70.16 79.09
7 SKW 0.11 0.20 0.28 0.21 0.20 0.00 0.70 87.88 70.16 78.03
The Highest F1 Value of Fusing Features by PCA after Removing One Single Feature
The Effects of Direct Weight Assignment Using One Aspect Features as a Unit Compared with Using One Feature as a Unit
测试集 序号 WSIIN WSCIN WSSIN R_threshold P/% R/% F1/%
LIS测试集 1 0.05 0.85 0.10 0.55 97.40 55.15 70.42
2 0.10 0.70 0.20 0.55 97.40 55.15 70.42
ECO测试集 1 0.05 0.95 0.00 0.70 95.60 70.16 80.93
2 0.10 0.90 0.00 0.70 95.60 70.16 80.93
3 0.15 0.85 0.00 0.70 95.60 70.16 80.93
The Evaluation Results When Fusing Features by Direct Weight Assignment Using One Aspect Features as a Unit and Getting the Highest F1 Value
The Average Value of Each Weight When Fusing Features by Direct Weight Assignment Using One Single Feature as a Unit and Getting the Highest F1 Value
The F1 Value of the Mixed Method Compared with That of Using PCA Only on LIS Test Dataset
The F1 Value of the Mixed Method Compared with That of Using PCA Only on ECO Test Dataset
[1] Strotmann A, Zhao D Z. Author Name Disambiguation: What Difference does It Make in Author-Based Citation Analysis?[J]. Journal of the American Society for Information Science & Technology, 2012,63(9):1820-1833.
[2] Kang I S, Na S H, Lee S, et al. On Co-Authorship for Author Disambiguation[J]. Information Processing & Management, 2009,45(1):84-97.
doi: 10.1016/j.ipm.2008.06.006
[3] 朱云霞. 中文文献题录数据作者重名消解问题研究[J]. 图书情报工作, 2014,58(23):143-148, 142.
[3] ( Zhu Yunxia. Study on Author Name Disambiguation for Chinese Bibliographic Information[J]. Library and Information Service, 2014,58(23):143-148,142.)
[4] 于夏薇. 基于唯一性特征的中文论文作者姓名消歧实证研究[D]. 北京: 中国科学技术信息研究所, 2017.
[4] ( Yu Xiawei. An Empirical Study of Disambiguation Based on Uniqueness of Chinese Authors Name[D]. Beijing: Institute of Scientific and Technical Information of China, 2017.)
[5] Haak L L, Fenner M, Paglione L, et al. ORCID: A System to Uniquely Identify Researchers[J]. Learned Publishing, 2012,25(4):259-264.
doi: 10.1087/20120404
[6] Youtie J, Carley S, Porter A L, et al. Tracking Researchers and Their Outputs: New Insights from ORCIDs[J]. Scientometrics, 2017,113(1):437-453.
doi: 10.1007/s11192-017-2473-0
[7] Sanyal D K, Bhowmick P K, Das P P. A Review of Author Name Disambiguation Techniques for the PubMed Bibliographic Database[J/OL]. Journal of Information Science. (2019-12-01). [2020-06-01]. https://doi.org/10.1177/0165551519888605.
[8] 陈嘉勇, 周婕, 李玲, 等. 基于文献实体关系模型的高校机构知识库作者认领模式研究[J]. 情报理论与实践, 2015,38(2):59-63.
[8] ( Chen Jiayong, Zhuo Jie, Li Ling, et al. Research on Author Claim Pattern for University Institutional Repository Based on Paper-Entity Relationship Model[J]. Information Studies: Theory & Application, 2015,38(2):59-63.)
[9] D’Angelo C A, van Eck N J. Collecting Large-Scale Publication Data at the Level of Individual Researchers: A Practical Proposal for Author Name Disambiguation[J]. Scientometrics, 2020,123(2):883-907.
doi: 10.1007/s11192-020-03410-y
[10] 刘巍, 祝忠明, 张旺强, 等. 机构知识库中作者标识与作品认领机制的研究与实现[J]. 现代图书情报技术, 2014(3):8-13.
[10] ( Liu Wei, Zhu Zhongming, Zhang Wangqiang, et al. Development and Research of Author Identifier and Item Claim Service for Institutional Repository[J]. New Technology of Library and Information Service, 2014(3):8-13.)
[11] 张旺强, 祝忠明, 李雅梅, 等. 机构知识库作者名自动消歧框架设计与实践[J]. 数据分析与知识发现, 2019,3(6):92-98.
[11] ( Zhang Wangqiang, Zhu Zhongming, Li Yamei, et al. Disambiguating Author Names Automatically for Institutional Repository[J]. Data Analysis and Knowledge Discovery, 2019,3(6):92-98.)
[12] 孙笑明, 李瑶, 王成军, 等. 基于专家研讨思想的发明人姓名消歧研究[J]. 情报科学, 2019,37(4):116-121.
[12] ( Sun Xiaoming, Li Yao, Wang Chengjun, et al. Research on Inventors’ Names Disambiguation Based on Expert Discussion[J]. Information Science, 2019,37(4):116-121.)
[13] Han H, Giles L, Zha H Y, et al. Two Supervised Learning Approaches for Name Disambiguation in Author Citations[C]// Proceedings of the 4th ACM/IEEE Joint Conference on Digital Libraries. New York: ACM, 2004: 296-305.
[14] 邓可君, 华凯, 邓昌明, 等. 基于机器学习的论文作者名消歧方法研究[J]. 四川大学学报(自然科学版), 2019,56(2):241-245.
[14] ( Deng Kejun, Hua Kai, Deng Changming, et al. Research on Author Name Disambiguation Method Based on Machine Learning[J]. Journal of Sichuan University (Natural Science Edition), 2019,56(2):241-245.)
[15] Ferreira A A, Goncalves M A, Laender A H F. A Brief Survey of Automatic Methods for Author Name Disambiguation[J]. Sigmod Record, 2012,41(2):15-26.
[16] 张雄, 陈福才, 黄瑞阳. 基于融合特征相似度的实体消歧方法研究[J]. 计算机应用研究, 2017,34(2):347-350, 396.
[16] ( Zhang Xiong, Chen Fucai, Huang Ruiyang. Research on Entity Disambiguation Method Based on Fusion Feature Similarity[J]. Application Research of Computers, 2017,34(2):347-350, 396.)
[17] 李孟亚. 基于融合特征的中文图书作者人名消歧方法研究[J]. 电脑知识与技术, 2018,14(11):182-184.
[17] ( Li Mengya. Research on Chinese Book Author’s Name Disambiguation Based on Fusion Features[J]. Computer Knowledge and Technology, 2018,14(11):182-184.)
[18] 杨欣欣, 李培峰, 朱巧明, 等. 一种基于改进的K-means算法的人名消歧系统的设计与实现[J]. 计算机与数字工程, 2010,38(8):10-12,17.
[18] ( Yang Xinxin, Li Peifeng, Zhu Qiaoming, et al. A Name Disambiguation Method Based on Improved K-means Algorithm[J]. Computer and Digital Engineering, 2010,38(8):10-12,17.)
[19] 朱亮亮. 利用改进的K-means算法实现文献著者人名消歧[J]. 软件导刊, 2013,12(5):63-66.
[19] ( Zhu Liangliang. Research on Name Disambiguation Based on Improved K-means Algorithm[J]. Software Guide, 2013,12(5):63-66.)
[20] 任景华. 利用优化的DBSCAN算法进行文献著者人名消歧[J]. 图书馆理论与实践, 2014(12):61-65.
[20] ( Ren Jinghua. Using the Optimized DBSCAN Algorithm for Disambiguation of the Names of the Authors[J]. Library Theory and Practice, 2014(12):61-65.)
[21] Kim K, Khabsa M, Giles C L. Inventor Name Disambiguation for a Patent Database Using a Random Forest and DBSCAN[C]//Proceedings of 2016 IEEE/ACM Joint Conference on Digital Libraries. New York: IEEE, 2016: 269-270.
[22] Han H Q, Yu Y S, Wang L J, et al. Disambiguating USPTO Inventor Names with Semantic Fingerprinting and DBSCAN Clustering[J]. The Electronic Library, 2019,37(2):225-239.
doi: 10.1108/EL-12-2018-0232
[23] 李维佳. 基于多层次聚类的同名区分算法研究与应用[D]. 大连: 大连理工大学, 2013.
[23] ( Li Weijia. The Research and Application of Name Disambiguation Algorithm Based on Multi-Level Clustering[D]. Dalian: Dalian University of Technology, 2013.)
[24] Zhu J, Wu X C, Lin X Q, et al. A Novel Multiple Layers Name Disambiguation Framework for Digital Libraries Using Dynamic Clustering[J]. Scientometrics, 2018,114(3):781-794.
doi: 10.1007/s11192-017-2611-8
[25] Zhang S Y, E X H, Pan T. A Multi-Level Author Name Disambiguation Algorithm[J]. IEEE Access, 2019,7:104250-104257.
doi: 10.1109/Access.6287639
[26] 郝丹丹, 郭景峰, 郑超. 基于属性关系图的同名实体区分算法[J]. 计算机工程与科学, 2010,32(9):61-64.
[26] ( Hao Dandan, Guo Jingfeng, Zheng Chao. An Algorithm Based on Attributed Relational Graphs for Name Disambiguation[J]. Computer Engineering and Science, 2010,32(9):61-64.)
[27] 黄斌. 社会网络中基于随机游走的名称消歧算法[J]. 计算机应用研究, 2015,32(12):3650-3653.
[27] ( Huang Bin. Random Walk Based Name Disambiguation Algorithm in Social Networks[J]. Application Research of Computers, 2015,32(12):3650-3653.)
[28] Pooja K M, Mondal S, Chandra J. An Unsupervised Heuristic Based Approach for Author Name Disambiguation[C]//Proceedings of the 10th International Conference on Communication Systems & Networks. New York, USA: IEEE, 2018: 540-542.
[29] Pooja K M, Mondal S, Chandra J. A Graph Combination with Edge Pruning‐Based Approach for Author Name Disambiguation[J]. Journal of the Association for Information Science and Technology, 2020,71(1):69-83.
doi: 10.1002/asi.v71.1
[30] Muller M C. Semantic Author Name Disambiguation with Word Embeddings[C]//Proceedings of 2017 International Conference on Theory and Practice of Digital Libraries. Cham, Switzerland: Springer, 2017: 300-311.
[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Meng Zhen,Wang Hao,Yu Wei,Deng Sanhong,Zhang Baolong. Vocal Music Classification Based on Multi-category Feature Fusion[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[3] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[4] Wang Yuzhu,Xie Jun,Chen Bo,Xu Xinying. Multi-modal Sentiment Analysis Based on Cross-modal Context-aware Attention[J]. 数据分析与知识发现, 2021, 5(4): 49-59.
[5] Han Pu, Zhang Wei, Zhang Zhanpeng, Wang Yuxin, Fang Haoyu. Sentiment Analysis of Weibo Posts on Public Health Emergency with Feature Fusion and Multi-Channel[J]. 数据分析与知识发现, 2021, 5(11): 68-79.
[6] Shen Zhe, Wang Yi, Yao Yifan, Cheng Ying. Author Name Disambiguation Techniques for Academic Literature: A Review[J]. 数据分析与知识发现, 2020, 4(8): 15-27.
[7] Liu Weijiang,Wei Hai,Yun Tianhe. Evaluation Model for Customer Credits Based on Convolutional Neural Network[J]. 数据分析与知识发现, 2020, 4(6): 80-90.
[8] Li Junlian,Wu Yingjie,Deng Panpan,Leng Fuhai. Automatic Data Processing Strategy of Citation Anomie Based on Feature Fusion[J]. 数据分析与知识发现, 2020, 4(5): 38-45.
[9] Yu Chuanming,Zhong Yunci,Lin Aochen,An Lu. Author Name Disambiguation with Network Embedding[J]. 数据分析与知识发现, 2020, 4(2/3): 48-59.
[10] Qi Ruihua,Jian Yue,Guo Xu,Guan Jinghua,Yang Mingxin. Sentiment Analysis of Cross-Domain Product Reviews Based on Feature Fusion and Attention Mechanism[J]. 数据分析与知识发现, 2020, 4(12): 85-94.
[11] Na Ma,Zhixiong Zhang,Pengmin Wu. Automatic Identification of Term Citation Object with Feature Fusion[J]. 数据分析与知识发现, 2020, 4(1): 89-98.
[12] Wangqiang Zhang,Zhongming Zhu,Yamei Li,Linong Lu,Wei Liu. Disambiguating Author Names Automatically for Institutional Repository[J]. 数据分析与知识发现, 2019, 3(6): 92-98.
[13] Yu Chuanming,Gong Yutian,Zhao Xiaoli,An Lu. Collaboration Recommendation of Finance Research Based on Multi-feature Fusion[J]. 数据分析与知识发现, 2017, 1(8): 39-47.
[14] Yang Bo, Yang Junwei, Yan Sulan. Research on Rule-based Normalization of Institution Name[J]. 现代图书情报技术, 2015, 31(6): 57-63.
[15] Yu Xianzi, Gao Yinglian, Ma Chunxia, Liu Jinxing. The Penalized Matrix Decomposition Method of Extracting Core Characteristic Words——Taking Co-word Analysis as an Example[J]. 现代图书情报技术, 2014, 30(3): 88-95.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn