Disambiguating Author Names with Embedding Heterogeneous Information and Attentive RNN Clustering Parameters
Wang Ruolin1,Niu Zhendong1,2(),Lin Qika3,Zhu Yifan1,Qiu Ping1,Lu Hao4,Liu Donglei1
1School of Computer, Beijing Institute of Technology, Beijing 100081, China 2Beijing Institute of Technology Library, Beijing 100081, China 3School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China 4Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
[Objective] This paper proposes a name disambiguation method for scientific literature, aiming to distinguish scholars with the same name. The existing solutions utilizes document feature extraction or relationship between documents and co-authors, which loses higher-order attributes. [Methods] First, we established a unified feature extraction framework of Paper Embedding Network (PaperEmbNet), which combined content and relationship to build an academic heterogeneous information network for each author. Then, we designed a Clustering Parameters Method (AR4CPM) based on the Attentive Recurrent Neural Network to estimate the clustering number directly. Finally, we used the Hierarchical agglomerative clustering algorithm (HAC) to disambiguate author names with the predicted number as the preset parameter. [Results] We examined the proposed model with the AMiner-AND dataset and found the macro-F1 score was up to 4.75% higher than the suboptimal model, and the average training time was 5-10 minutes shorter than the existing baselines. [Limitations] We need to evaluate the performance of the proposed method with multilingual environment. [Conclusions] The proposed approach could effectively conduct the name disambiguation tasks.
Bekkerman R, McCallum A. Disambiguating Web Appearances of People in a Social Network[C]// Proceedings of the 14th International Conference on World Wide Web. 2005: 463-470.
[2]
Hermansson L, Kerola T, Johansson F, et al. Entity Disambiguation in Anonymized Graphs Using Graph Kernels[C]// Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 2013: 1037-1046.
[3]
Kanani P, McCallum A, Pal C. Improving Author Coreference by Resource-bounded Information Gathering from the Web[C]// Proceedings of the 20th International Joint Conference on Artifical Intelligence. 2007: 429-434.
[4]
Steorts R C, Ventura S L, Sadinle M, et al. A Comparison of Blocking Methods for Record Linkage[C]// Proceedings of International Conference on Privacy in Statistical Databases. Springer International Publishing, 2014: 253-268.
[5]
Yoshida M, Ikeda M, Ono S, et al. Person Name Disambiguation by Bootstrapping[C]// Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2010: 10-17.
( Fu Yuan, Zhu Lijun, Han Hongqi. A Survey of Name Disambiguation[J]. Technology Intelligence Engineering, 2016, 2(1):53-58.)
[7]
Tang J, Fong A C M, Wang B, et al. A Unified Probabilistic Framework for Name Disambiguation in Digital Library[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(6):975-987.
doi: 10.1109/TKDE.2011.13
[8]
Han H, Giles L, Zha H Y, et al. Two Supervised Learning Approaches for Name Disambiguation in Author Citations[C]// Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries. 2004: 296-305.
[9]
Sain S R. The Nature of Statistical Learning Theory[J]. Technometrics, 1996, 38(4):409.
[10]
Huang J, Ertekin S, Giles C L. Efficient Name Disambiguation for Large-Scale Databases[C]// Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2006: 536-544.
[11]
Lee D, On B W, Kang J, et al. Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries[C]// Proceedings of the 2nd International Workshop on Information Quality in Information Systems. 2005: 69-76.
[12]
Zhang B C, Hasan M A. Name Disambiguation in Anonymized Graphs Using Network Embedding[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 1239-1248.
( Yu Chuanming, Zhong Yunci, Lin Aochen, et al. Author Name Disambiguation with Network Embedding[J]. Data Analysis and Knowledge Discovery, 2020, 4(2/3):48-59.)
( Shen Zhe, Wang Yi, Yao Yifan, et al. Author Name Disambiguation Techniques for Academic Literature: A Review[J]. Data Analysis and Knowledge Discovery, 2020, 4(8):15-27.)
[15]
Wang H W, Wang R J, Wen C, et al. Author Name Disambiguation on Heterogeneous Information Network with Adversarial Representation Learning[C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2020: 238-245.
[16]
Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online Learning of Social Representations[C]// Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 701-710.
[17]
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[C]// Proceedings of the International Conference on Learning Representations. 2013.
[18]
Grover A, Leskovec J. Node2Vec: Scalable Feature Learning for Networks[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 855-864.
[19]
Shi C, Li Y T, Zhang J W, et al. A Survey of Heterogeneous Information Network Analysis[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(1):17-37.
doi: 10.1109/TKDE.2016.2598561
[20]
Chang S Y, Han W, Tang J L, et al. Heterogeneous Network Embedding via Deep Architectures[C]// Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2015: 119-128.
[21]
Yun S, Jeong M, Kim R, et al. Graph Transformer Networks[C]// Proceedings of the 33rd Conference on Neural Information Processing Systems. 2019: 11960-11970.
[22]
Wang X, Ji H Y, Shi C, et al. Heterogeneous Graph Attention Network[C]// Proceedings of the 2019 International Conference on World Wide Web. 2019: 2022-2032.
[23]
Shi C, Hu B B, Zhao W X, et al. Heterogeneous Information Network Embedding for Recommendation[J]. IEEE Transactions on Knowledge and Data Engineering, 2019, 31(2):357-370.
doi: 10.1109/TKDE.2018.2833443
[24]
Le Q, Mikolov T. Distributed Representations of Sentences and Documents[C]// Proceedings of the 31st International Conference on International Conference on Machine Learning. 2014: 1188-1196.
[25]
Tang J, Qu M, Wang M Z, et al. LINE: Large-scale Information Network Embedding[C]// Proceedings of the 24th International Conference on World Wide Web. 2015: 1067-1077.
[26]
Tenenbaum J B, Silva V D, Langford J C. A Global Geometric Framework for Nonlinear Dimensionality Reduction[J]. Science, 2000, 290(5500):2319-2323.
pmid: 11125149
[27]
Belkin M, Niyogi P. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering[C]// Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. 2001: 585-591.
[28]
Pelleg D, Moore A W. X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters[C]// Proceedings of the 17th International Conference on Machine Learning. 2000: 727-734.
[29]
Zhang Y T, Zhang F J, Yao P R, et al. Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop[C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2018: 1002-1011.
[30]
Cho K, van Merrienboer B, Gulcehre C, et al. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1724-1734.
[31]
Bahdanau D, Cho K, Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate[C]// Proceedings of the 3rd International Conference on Learning Representations. 2015.
[32]
Fan X M, Wang J Y, Pu X, et al. On Graph-Based Name Disambiguation[J]. Journal of Data and Information Quality, 2011, 2(2):Article No.10.