Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (8): 13-24    DOI: 10.11925/infotech.2096-3467.2021.0253
Current Issue | Archive | Adv Search |
Disambiguating Author Names with Embedding Heterogeneous Information and Attentive RNN Clustering Parameters
Wang Ruolin1,Niu Zhendong1,2(),Lin Qika3,Zhu Yifan1,Qiu Ping1,Lu Hao4,Liu Donglei1
1School of Computer, Beijing Institute of Technology, Beijing 100081, China
2Beijing Institute of Technology Library, Beijing 100081, China
3School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China
4Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Download: PDF (1363 KB)   HTML ( 6
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a name disambiguation method for scientific literature, aiming to distinguish scholars with the same name. The existing solutions utilizes document feature extraction or relationship between documents and co-authors, which loses higher-order attributes. [Methods] First, we established a unified feature extraction framework of Paper Embedding Network (PaperEmbNet), which combined content and relationship to build an academic heterogeneous information network for each author. Then, we designed a Clustering Parameters Method (AR4CPM) based on the Attentive Recurrent Neural Network to estimate the clustering number directly. Finally, we used the Hierarchical agglomerative clustering algorithm (HAC) to disambiguate author names with the predicted number as the preset parameter. [Results] We examined the proposed model with the AMiner-AND dataset and found the macro-F1 score was up to 4.75% higher than the suboptimal model, and the average training time was 5-10 minutes shorter than the existing baselines. [Limitations] We need to evaluate the performance of the proposed method with multilingual environment. [Conclusions] The proposed approach could effectively conduct the name disambiguation tasks.

Key wordsName Disambiguation      Academic Heterogeneous Information Network      Graph Embedding      Clustering     
Received: 12 March 2021      Published: 15 September 2021
ZTFLH:  TP391  
Fund:National Key R&D Program of China(2019YFB1406302)
Corresponding Authors: Niu Zhendong ORCID:0000-0002-0576-7572     E-mail: zniu@bit.edu.cn

Cite this article:

Wang Ruolin, Niu Zhendong, Lin Qika, Zhu Yifan, Qiu Ping, Lu Hao, Liu Donglei. Disambiguating Author Names with Embedding Heterogeneous Information and Attentive RNN Clustering Parameters. Data Analysis and Knowledge Discovery, 2021, 5(8): 13-24.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0253     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I8/13

An Illustration of Name Disambiguation Task
The Overview of the Proposed Model
符号 描述
i 作者姓名 i
P i 作者姓名为 i的文章集合
p j i 和作者姓名 i关联的文章 j
I j i 文章 j的内容特征集合
R j i 文章 j的关系特征集合
C i 作者姓名 i的集群
a k 真实世界中的作者
Notations
姓名 本文方法 AMiner全局方法 GHOST方法 Zhang等方法 基于规则的方法
Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1
Xu Xu 70.11 56.19 58.39 70.40 46.04 54.95 61.34 21.79 32.15 47.73 39.98 43.51 10.75 97.23 19.35
Rong Yu 64.72 43.88 52.30 47.46 41.05 44.03 92.00 36.41 52.17 66.53 36.90 47.47 30.81 97.79 46.86
Yong Tian 65.74 46.87 54.73 69.69 46.68 54.51 86.94 54.58 67.06 73.18 56.34 63.66 10.37 93.79 18.67
Lu Han 66.10 45.87 54.16 70.01 44.75 53.54 69.72 17.39 27.84 46.05 17.95 25.83 13.66 89.16 23.69
Lin Huang 58.92 41.24 48.52 47.60 41.13 44.13 86.15 17.25 28.74 69.43 33.13 44.86 13.86 99.46 24.33
Kexin Xu 61.04 41.91 49.70 48.47 41.33 44.61 92.90 28.52 43.64 85.74 44.13 58.27 91.45 99.60 95.35
Wei Quan 67.67 44.54 53.72 70.65 45.16 53.56 86.42 27.80 42.07 74.41 33.94 46.62 28.16 93.80 43.32
Tao Deng 74.55 44.64 55.84 74.50 45.53 55.71 73.33 24.50 36.73 55.25 27.93 37.11 16.30 95.16 27.84
Hongbin Li 60.60 41.48 49.25 83.83 53.46 64.72 56.29 29.12 38.39 65.79 52.86 58.62 13.25 96.41 23.30
Hua Bai 65.58 56.37 60.67 78.93 48.29 59.40 83.06 29.54 43.58 54.93 35.97 43.47 25.47 98.51 40.47
Meiling Chen 72.04 52.63 60.82 47.50 41.24 44.15 86.11 23.85 37.35 79.22 25.15 38.18 59.55 82.07 69.02
Yanqing Wang 72.46 48.85 56.68 39.41 58.17 47.16 80.79 40.39 53.86 72.73 42.62 53.74 25.72 62.47 36.44
XudongZhang 74.92 48.39 58.80 70.48 45.68 53.82 85.75 7.23 13.34 55.63 8.11 14.16 63.22 17.94 27.95
Qiang Shi 71.26 40.01 51.23 72.43 46.78 55.78 53.72 26.80 35.76 43.33 37.99 40.49 28.79 93.89 44.06
Min Zheng 68.44 47.43 56.03 72.01 47.26 55.44 80.50 15.21 25.58 53.62 17.63 26.54 15.41 98.72 26.66
Avg. 78.17 47.88 59.31 68.40 47.42 54.56 81.62 40.43 50.23 70.22 48.72 57.53 44.94 89.30 53.42
The Overall Result of Name Disambiguation
t-SNE Visualization of the Embedding Spaces on Name “Wang Shui” in AMiner-AND
Runtime Difference Between PaperEmbNet and Baselines
Plot of Feature Contribution in Terms of Precsion, Recall and F1 Score
The Effects of Embedding Dimension on Results
人名 实际值 本文方法 AMiner X_means
Xudong Zhang 69 66.35 55.79 9
Ruijin Liao 6 7.19 3.22 10
Zhifeng Liu 49 45.67 31.88 8
Yongqing Huang 9 8.08 5.26 3
Yongqing Li 30 28.31 39.57 10
Meiling Chen 38 40.25 48.13 12
Xiaoning Zhang 36 35.93 29.30 5
Jiamo Fu 7 7.31 3.78 4
Geng Yang 20 20.90 10.12 5
Zhigang Zeng 18 21.86 10.54 7
Results of Clustering Size Prediction
[1] Bekkerman R, McCallum A. Disambiguating Web Appearances of People in a Social Network[C]// Proceedings of the 14th International Conference on World Wide Web. 2005: 463-470.
[2] Hermansson L, Kerola T, Johansson F, et al. Entity Disambiguation in Anonymized Graphs Using Graph Kernels[C]// Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 2013: 1037-1046.
[3] Kanani P, McCallum A, Pal C. Improving Author Coreference by Resource-bounded Information Gathering from the Web[C]// Proceedings of the 20th International Joint Conference on Artifical Intelligence. 2007: 429-434.
[4] Steorts R C, Ventura S L, Sadinle M, et al. A Comparison of Blocking Methods for Record Linkage[C]// Proceedings of International Conference on Privacy in Statistical Databases. Springer International Publishing, 2014: 253-268.
[5] Yoshida M, Ikeda M, Ono S, et al. Person Name Disambiguation by Bootstrapping[C]// Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2010: 10-17.
[6] 付媛, 朱礼军, 韩红旗. 姓名消歧方法研究进展[J]. 情报工程, 2016, 2(1):53-58.
[6] ( Fu Yuan, Zhu Lijun, Han Hongqi. A Survey of Name Disambiguation[J]. Technology Intelligence Engineering, 2016, 2(1):53-58.)
[7] Tang J, Fong A C M, Wang B, et al. A Unified Probabilistic Framework for Name Disambiguation in Digital Library[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(6):975-987.
doi: 10.1109/TKDE.2011.13
[8] Han H, Giles L, Zha H Y, et al. Two Supervised Learning Approaches for Name Disambiguation in Author Citations[C]// Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries. 2004: 296-305.
[9] Sain S R. The Nature of Statistical Learning Theory[J]. Technometrics, 1996, 38(4):409.
[10] Huang J, Ertekin S, Giles C L. Efficient Name Disambiguation for Large-Scale Databases[C]// Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2006: 536-544.
[11] Lee D, On B W, Kang J, et al. Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries[C]// Proceedings of the 2nd International Workshop on Information Quality in Information Systems. 2005: 69-76.
[12] Zhang B C, Hasan M A. Name Disambiguation in Anonymized Graphs Using Network Embedding[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 1239-1248.
[13] 余传明, 钟韵辞, 林奥琛, 等. 基于网络表示学习的作者重名消歧研究[J]. 数据分析与知识发现, 2020, 4(2/3):48-59.
[13] ( Yu Chuanming, Zhong Yunci, Lin Aochen, et al. Author Name Disambiguation with Network Embedding[J]. Data Analysis and Knowledge Discovery, 2020, 4(2/3):48-59.)
[14] 沈喆, 王毅, 姚毅凡, 等. 面向学术文献的作者名消歧方法研究综述[J]. 数据分析与知识发现, 2020, 4(8):15-27.
[14] ( Shen Zhe, Wang Yi, Yao Yifan, et al. Author Name Disambiguation Techniques for Academic Literature: A Review[J]. Data Analysis and Knowledge Discovery, 2020, 4(8):15-27.)
[15] Wang H W, Wang R J, Wen C, et al. Author Name Disambiguation on Heterogeneous Information Network with Adversarial Representation Learning[C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2020: 238-245.
[16] Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online Learning of Social Representations[C]// Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 701-710.
[17] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[C]// Proceedings of the International Conference on Learning Representations. 2013.
[18] Grover A, Leskovec J. Node2Vec: Scalable Feature Learning for Networks[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 855-864.
[19] Shi C, Li Y T, Zhang J W, et al. A Survey of Heterogeneous Information Network Analysis[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(1):17-37.
doi: 10.1109/TKDE.2016.2598561
[20] Chang S Y, Han W, Tang J L, et al. Heterogeneous Network Embedding via Deep Architectures[C]// Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2015: 119-128.
[21] Yun S, Jeong M, Kim R, et al. Graph Transformer Networks[C]// Proceedings of the 33rd Conference on Neural Information Processing Systems. 2019: 11960-11970.
[22] Wang X, Ji H Y, Shi C, et al. Heterogeneous Graph Attention Network[C]// Proceedings of the 2019 International Conference on World Wide Web. 2019: 2022-2032.
[23] Shi C, Hu B B, Zhao W X, et al. Heterogeneous Information Network Embedding for Recommendation[J]. IEEE Transactions on Knowledge and Data Engineering, 2019, 31(2):357-370.
doi: 10.1109/TKDE.2018.2833443
[24] Le Q, Mikolov T. Distributed Representations of Sentences and Documents[C]// Proceedings of the 31st International Conference on International Conference on Machine Learning. 2014: 1188-1196.
[25] Tang J, Qu M, Wang M Z, et al. LINE: Large-scale Information Network Embedding[C]// Proceedings of the 24th International Conference on World Wide Web. 2015: 1067-1077.
[26] Tenenbaum J B, Silva V D, Langford J C. A Global Geometric Framework for Nonlinear Dimensionality Reduction[J]. Science, 2000, 290(5500):2319-2323.
pmid: 11125149
[27] Belkin M, Niyogi P. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering[C]// Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. 2001: 585-591.
[28] Pelleg D, Moore A W. X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters[C]// Proceedings of the 17th International Conference on Machine Learning. 2000: 727-734.
[29] Zhang Y T, Zhang F J, Yao P R, et al. Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop[C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2018: 1002-1011.
[30] Cho K, van Merrienboer B, Gulcehre C, et al. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1724-1734.
[31] Bahdanau D, Cho K, Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate[C]// Proceedings of the 3rd International Conference on Learning Representations. 2015.
[32] Fan X M, Wang J Y, Pu X, et al. On Graph-Based Name Disambiguation[J]. Journal of Data and Information Quality, 2011, 2(2):Article No.10.
[1] Wang Xiwei,Jia Ruonan,Wei Yanan,Zhang Liu. Clustering User Groups of Public Opinion Events from Multi-dimensional Social Network[J]. 数据分析与知识发现, 2021, 5(6): 25-35.
[2] Lu Linong,Zhu Zhongming,Zhang Wangqiang,Wang Xiaochun. Cross-database Knowledge Integration and Fingerprint of Institutional Repositories with Lingo3G Clustering Algorithm[J]. 数据分析与知识发现, 2021, 5(5): 127-132.
[3] Lin Kerou,Wang Hao,Gong Lijuan,Zhang Baolong. Disambiguation of Chinese Author Names with Multiple Features[J]. 数据分析与知识发现, 2021, 5(4): 90-102.
[4] Zhang Mengyao, Zhu Guangli, Zhang Shunxiang, Zhang Biao. Grouping Microblog Users of Trending Topics Based on Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(2): 43-49.
[5] Yu Fengchang,Cheng Qikai,Lu Wei. Locating Academic Literature Figures and Tables with Geometric Object Clustering[J]. 数据分析与知识发现, 2021, 5(1): 140-149.
[6] Wen Pingmei,Ye Zhiwei,Ding Wenjian,Liu Ying,Xu Jian. Developments of Named Entity Disambiguation[J]. 数据分析与知识发现, 2020, 4(9): 15-25.
[7] Wu Jinming,Hou Yuefang,Cui Lei. Automatic Expression of Co-occurrence Clustering Based on Indexing Rules of Medical Subject Headings[J]. 数据分析与知识发现, 2020, 4(9): 133-144.
[8] Xi Yunjiang, Du Diedie, Liao Xiao, Zhang Xuehong. Analyzing & Clustering Enterprise Microblog Users with Supernetwork[J]. 数据分析与知识发现, 2020, 4(8): 107-118.
[9] Shen Zhe, Wang Yi, Yao Yifan, Cheng Ying. Author Name Disambiguation Techniques for Academic Literature: A Review[J]. 数据分析与知识发现, 2020, 4(8): 15-27.
[10] Yang Xu,Qian Xiaodong. Synchronous Clustering Algorithm for Social Networks Based on Improved Vicsek Model[J]. 数据分析与知识发现, 2020, 4(4): 119-128.
[11] Xiong Huixiang,Li Xiaomin,Li Yueyan. Group Recommendation Based on Attribute Mining of Book Reviews[J]. 数据分析与知识发现, 2020, 4(2/3): 214-222.
[12] Yu Chuanming,Zhong Yunci,Lin Aochen,An Lu. Author Name Disambiguation with Network Embedding[J]. 数据分析与知识发现, 2020, 4(2/3): 48-59.
[13] Wei Jiaze,Dong Cheng,He Yanqing,Liu Zhihui,Peng Keyun. Detecting News Topics Based on Equalized Paragraph and Sub-topic Vector[J]. 数据分析与知识发现, 2020, 4(10): 70-79.
[14] Huaming Zhao,Li Yu,Qiang Zhou. Determining Best Text Clustering Number with Mean Shift Algorithm[J]. 数据分析与知识发现, 2019, 3(9): 27-35.
[15] Shan Li,Yehui Yao,Hao Li,Jie Liu,Karmapemo. ISA Biclustering Algorithm for Group Recommendation[J]. 数据分析与知识发现, 2019, 3(8): 77-87.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn