Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (2/3): 48-59    DOI: 10.11925/infotech.2096-3467.2019.0644
Current Issue | Archive | Adv Search |
Author Name Disambiguation with Network Embedding
Yu Chuanming1,Zhong Yunci1,Lin Aochen1,An Lu2
1School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan 430073, China
2School of Information Management, Wuhan University, Wuhan 430072, China
Download: PDF(961 KB)   HTML ( 10
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] The paper tries to eliminate the ambiguity of author names in the document system, aiming to solve the problem of incorrect document aggregation.[Methods] First, we constructed three types of networks for authors, documents and author-documents, with structured document data. Then we combined different network embedding methods to obtain the representation of document nodes. Finally, we employed the unsupervised learning model and the hierarchical agglomerative clustering to process the documents.[Results] We conducted empirical studies on datasets from ArnetMiner, CiteSeerX and DBLP. Our method performed well on sparse networks and the macro-F1 value increased by 6%.[Limitations] We only explored author name disambiguation in English.[Conclusions] The proposed method could effectively reduce the ambiguity of author names. It is of great significance for scientific collaboration and citation recommendation, as well as knowledge network related research.

Key wordsNetwork Embedding      Heterogeneous Network      Author Name Disambiguation      Unsupervised Learning     
Received: 11 June 2019      Published: 26 April 2020
ZTFLH:  TP391  
Corresponding Authors: Chuanming Yu   

Cite this article:

Yu Chuanming,Zhong Yunci,Lin Aochen,An Lu. Author Name Disambiguation with Network Embedding. Data Analysis and Knowledge Discovery, 2020, 4(2/3): 48-59.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.0644     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I2/3/48

Framework of Author Name Disambiguation Based on Network Embedding
Network Representation Model for Authors and Documents
XML标签 元数据
<title> Trust Mechanism in Distributed Access Control Model of P2P Networks
<authors> Lei Wang,Yanqin Zhu,Lanfang Jin,Xizhao Luo
<label> 0
<id> 4944
<jconf> ACIS-ICIS
<year> 2008
<organization> null
Data Example
Macro-F1_
arnetminer
AuthorList AuthorList-NMF NDNE ADNE 本文方法
Lei Wang 23.09 20.04 76.97 28.39 78.64
Jing Zhang 24.58 25.37 73.48 49.56 77.04
Yu Zhang 27.98 17.51 60.28 19.24 55.86
Bin Li 25.82 19.86 80.34 42.11 78.30
Yang Wang 19.01 18.70 53.06 21.07 54.42
Hao Wang 17.23 9.15 54.81 30.67 50.49
Wei Xu 24.57 18.81 66.46 25.05 72.58
Bo Liu 19.24 25.66 86.71 19.05 79.65
Gang Chen 25.79 9.77 63.07 28.09 67.99
Lei Chen 21.13 11.77 60.37 28.96 60.67
Experimental Results on ArnetMiner (%)
Macro-F1_citeseerx AuthorList AuthorList-NMF NDNE ADNE 本文方法
J Lee 6.41 6.25 42.58 6.62 21.12
S Lee 4.94 4.93 39.79 6.02 33.45
Y Chen 9.45 7.20 47.52 10.07 26.98
C Chen 11.20 4.92 35.63 7.89 18.03
J Smith 9.75 8.51 35.81 9.02 24.47
A Gupta 3.93 5.20 41.14 5.73 23.63
J Martin 17.17 13.83 53.98 22.58 41.05
D Johnson 12.91 15.23 28.55 17.85 24.07
A Kumar 25.67 21.78 35.74 17.96 14.33
M Brown 17.80 19.48 46.11 29.04 24.68
Experimental Results on CiteSeerX(%)
Macro-F1_dblp AuthorList AuthorList-NMF NDNE ADNE 本文
方法
Wei Wang 12.94 2.37 70.30 12.73 29.56
Yi Zhang 24.91 10.89 34.68 39.31 31.98
Jian Zhang 30.43 13.46 33.83 33.13 23.52
Jing Wang 16.67 11.92 77.00 58.33 67.71
Lei Zhang 5.94 9.68 50.54 8.96 19.28
Wei Li 18.94 4.67 42.52 31.45 32.03
Yang Wang 16.07 12.52 39.98 30.68 47.67
Minsoo Kim 17.73 21.08 43.24 33.85 52.34
Rui Wang 32.16 11.04 50.55 25.38 55.83
Jun Sun 17.42 16.97 58.04 24.57 40.63
Experimental Results on DBLP(%)
网络特征 ArnetMiner CiteSeerX DBLP
平均文献条目数 197.9 733.0 141.7
平均真实作者数 61.4 43.2 13.7
平均节点数(作者网络) 323.5 681.0 160.0
平均边数(作者网络) 600.5 1763.4 426.9
平均节点度数(作者网络) 3.7 4.7 5.0
平均边数(文献网络) 783.7 36541.7 2338.4
平均节点度数(文献网络) 8.1 83.1 19.4
Network Statistical Characteristics of Three Data Sets
The Influence of Learning Iteration Numbers on Model Performance
The Influence of Embedding Dimensions on Model Performance
The Influence of LINE Similarity Measures on Model Performance
The Influence of Switching DeepWalk and LINE on Model Performance
[1] 章顺瑞, 游宏梁 . 现代图书情报技术[J]. 现代图书情报技术, 2010(11):64-68.
[1] ( Zhang Shunrui, You Hongliang . Chinese People Name Disambiguation by Hierarchical Clustering[J]. New Technology of Library and Information Service, 2010(11):64-68.)
[2] 肖晶, 梁冰, 张晓丹 , 等. 现代图书情报技术[J]. 现代图书情报技术, 2012(5):55-59.
[2] ( Xiao Jing, Liang Bing, Zhang Xiaodan , et al. Author Disambiguation Rules and Algorithm for Article Level Data[J]. New Technology of Library and Information Service, 2012(5):55-59.)
[3] 刘斌, 赵升, 孙笑明 , 等. 我国专利数据中发明家姓名消歧算法研究[J]. 情报学报, 2016,35(4):405-414.
[3] ( Liu Bin, Zhao Sheng, Sun Xiaoming , et al. Research on Inventors’ Names Disambiguation Algorithm in Chinese Patent Data[J]. Journal of the China Society for Scientific and Technical Information, 2016,35(4):405-414.)
[4] 周杰, 李弼程, 唐永旺 . 基于关键证据与E 2LSH的增量式人名聚类消歧方法 [J]. 情报学报, 2016,35(7):714-722.
[4] ( Zhou Jie, Li Bicheng, Tang Yongwang . Incremental Clustering Method Based on Key Evidence and E 2LSH for Person Name Disambiguation [J]. Journal of the China Society for Scientific and Technical Information, 2016,35(7):714-722.)
[5] 郭舒 . 现代图书情报技术[J]. 现代图书情报技术, 2013(7/8):69-74.
[5] ( Guo Shu . Research on Author Name Disambiguation Algorithm in the Literature Database[J]. New Technology of Library and Information Service, 2013(7/8):69-74.)
[6] Han H, Giles L, Zha H , et al. Two Supervised Learning Approaches for Name Disambiguation in Author Citations [C]//Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital libraries, Tucson, Arizona, USA. New York, USA: ACM, 2004: 296-305.
[7] Giles C L, Zha H, Han H . Name Disambiguation in Author Citations Using a K-way Spectral Clustering Method [C]//Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, Denver, Colorado, USA. New York, USA: ACM, 2005: 334-343.
[8] Tang J, Fong A C M, Wang B , et al. A Unified Probabilistic Framework for Name Disambiguation in Digital Library[J]. IEEE Transactions on Knowledge and Data Engineering, 2012,24(6):975-987.
[9] Hermansson L, Kerola T, Johansson F , et al. Entity Disambiguation in Anonymized Graphs Using Graph Kernels [C]//Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, San Francisco, California, USA. New York, USA: ACM, 2013: 1037-1046.
[10] Saha T K, Zhang B, Hasan M A . Name Disambiguation from Link Data in a Collaboration Graph Using Temporal and Topological Features[J]. Social Network Analysis and Mining, 2015,5(1):1-14.
[11] 涂存超, 杨成, 刘知远 , 等. 网络表示学习综述[J]. 中国科学:信息科学, 2017,47(8):32-48.
[11] ( Tu Cunchao, Yang Cheng, Liu Zhiyuan , et al. Network Representation Learning: An Overview[J]. Scientia Sinica (Informationis), 2017,47(8):32-48.)
[12] Perozzi B, Al-Rfou R, Skiena S . DeepWalk: Online Learning of Social Representations [C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, USA. 2014: 701-710.
[13] Mikolov T, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[14] Mikolov T, Sutskever I, Chen K , et al. Distributed Representations of Words and Phrases and Their Compositionality [C]//Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada, USA. USA: Curran Associates, 2013: 3111-3119.
[15] Grover A, Leskovec J . Node2vec: Scalable Feature Learning for Networks [C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, USA. 2016: 855-864.
[16] 陈丽, 朱裴松, 钱铁云 , 等. 基于边采样的网络表示学习模型[J]. 软件学报, 2018,29(3):756-771.
[16] ( Chen Li, Zhu Peisong, Qian Tieyun , et al. Edge Sampling Based Network Embedding Model[J]. Journal of Software, 2018,29(3):756-771.)
[17] Tang J, Qu M, Wang M , et al. LINE: Large-scale Information Network Embedding [C]// Proceedings of the 24th International Conference on World Wide Web, Florence, Italy. 2015: 1067-1077.
[18] Wang D, Peng C, Zhu W . Structural Deep Network Embedding [C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, USA. 2016: 1225-1234.
[19] Yang C, Liu Z, Zhao D , et al. Network Representation Learning with Rich Text Information[C]// Proceedings of the 24th International Conference on Artificial Intelligence, Buenos Aires, Argentina. San Francisco, California, USA: AAAI Press, 2015: 2111-2117.
[20] Tu C, Liu H, Liu Z , et al. CANE: Context-Aware Network Embedding for Relation Modeling [C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada. ACL, 2017: 1722-1731.
[21] 刘正铭, 马宏, 刘树新 , 等. 一种融合节点文本属性信息的网络表示学习算法[J]. 计算机工程, 2018,44(11):165-171.
[21] ( Liu Zhengming, Ma Hong, Liu Shuxin , et al. A Network Representation Learning Algorithm Fusing with Textual Attribute Information of Nodes[J]. Computer Engineering, 2018,44(11):165-171.)
[22] ArnetMiner Name Disambiguation Dataset [EB/OL]. [2019-01-01].https://www.aminer.cn/disambiguation.
[23] CiteSeerX Name Disambiguation Dataset [EB/OL]. [2019-01-01]. http://clgiles.ist.psu.edu/data/.
[24] Xu J, Shen S Q, Li D S , et al. A Network-embedding Based Method for Author Disambiguation [C]// Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino, Italy. New York, USA: ACM, 2018: 1735-1738.
[25] Zhang B, Hasan M A . Name Disambiguation in Anonymized Graphs Using Network Embedding [C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management, Singapore. New York, USA: ACM, 2017: 1239-1248.
[1] Wangqiang Zhang,Zhongming Zhu,Yamei Li,Linong Lu,Wei Liu. Disambiguating Author Names Automatically for Institutional Repository[J]. 数据分析与知识发现, 2019, 3(6): 92-98.
[2] Sui Mingshuang,Cui Lei. Extracting Chemical and Disease Named Entities with Multiple-Feature CRF Model[J]. 现代图书情报技术, 2016, 32(10): 91-97.
[3] Yang Bo, Yang Junwei, Yan Sulan. Research on Rule-based Normalization of Institution Name[J]. 现代图书情报技术, 2015, 31(6): 57-63.
[4] Guo Shu. Research on Author Name Disambiguation Algorithm in the Literature Database[J]. 现代图书情报技术, 2013, 29(7/8): 69-74.
[5] Shi Jing,Zhang Lijuan. Extending Inside-outside Algorithm by Using HowNet[J]. 现代图书情报技术, 2009, 25(7-8): 54-58.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn