Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (2/3): 48-59    DOI: 10.11925/infotech.2096-3467.2019.0644
  专辑 本期目录 | 过刊浏览 | 高级检索 |
基于网络表示学习的作者重名消歧研究*
余传明1,钟韵辞1,林奥琛1,安璐2
1中南财经政法大学信息与安全工程学院 武汉 430073
2武汉大学信息管理学院 武汉 430072
Author Name Disambiguation with Network Embedding
Yu Chuanming1,Zhong Yunci1,Lin Aochen1,An Lu2
1School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan 430073, China
2School of Information Management, Wuhan University, Wuhan 430072, China
全文: PDF(961 KB)   HTML ( 10
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 消除文献系统中的作者重名歧义,以解决其导致的文献错误聚合问题。【方法】 通过结构化文献数据建立作者网络、文献网络以及作者-文献网络,融合不同网络表示学习方法获得文献节点表示,并采用无监督学习方法,将文献节点表示作为特征,使用层次凝聚聚类按照真实作者对文献进行正确划分。【结果】 在ArnetMiner、CiteSeerX和DBLP三组数据集上进行实证研究,本文方法在网络稀疏的情况下仍然具有较好的效果,Macro-F1值在次优模型基础上最高提升6%。【局限】 仅研究英文情境下的作者重名消歧。【结论】 基于网络表示学习的方法能够有效解决作者重名消歧问题,实验结果对于改进科研合作推荐、引文推荐以及知识网络相关研究具有重要意义。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
余传明
钟韵辞
林奥琛
安璐
关键词 网络表示学习异构网络作者重名消歧无监督学习    
Abstract

[Objective] The paper tries to eliminate the ambiguity of author names in the document system, aiming to solve the problem of incorrect document aggregation.[Methods] First, we constructed three types of networks for authors, documents and author-documents, with structured document data. Then we combined different network embedding methods to obtain the representation of document nodes. Finally, we employed the unsupervised learning model and the hierarchical agglomerative clustering to process the documents.[Results] We conducted empirical studies on datasets from ArnetMiner, CiteSeerX and DBLP. Our method performed well on sparse networks and the macro-F1 value increased by 6%.[Limitations] We only explored author name disambiguation in English.[Conclusions] The proposed method could effectively reduce the ambiguity of author names. It is of great significance for scientific collaboration and citation recommendation, as well as knowledge network related research.

Key wordsNetwork Embedding    Heterogeneous Network    Author Name Disambiguation    Unsupervised Learning
收稿日期: 2019-06-11     
中图分类号:  TP391  
通讯作者: 余传明   
引用本文:   
余传明,钟韵辞,林奥琛,安璐. 基于网络表示学习的作者重名消歧研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 48-59.
Yu Chuanming,Zhong Yunci,Lin Aochen,An Lu. Author Name Disambiguation with Network Embedding. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2019.0644.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.0644
图1  基于网络表示学习的作者重名消歧框架
图2  作者与文献的网络表示模型
XML标签 元数据
<title> Trust Mechanism in Distributed Access Control Model of P2P Networks
<authors> Lei Wang,Yanqin Zhu,Lanfang Jin,Xizhao Luo
<label> 0
<id> 4944
<jconf> ACIS-ICIS
<year> 2008
<organization> null
表1  实验数据样例
Macro-F1_
arnetminer
AuthorList AuthorList-NMF NDNE ADNE 本文方法
Lei Wang 23.09 20.04 76.97 28.39 78.64
Jing Zhang 24.58 25.37 73.48 49.56 77.04
Yu Zhang 27.98 17.51 60.28 19.24 55.86
Bin Li 25.82 19.86 80.34 42.11 78.30
Yang Wang 19.01 18.70 53.06 21.07 54.42
Hao Wang 17.23 9.15 54.81 30.67 50.49
Wei Xu 24.57 18.81 66.46 25.05 72.58
Bo Liu 19.24 25.66 86.71 19.05 79.65
Gang Chen 25.79 9.77 63.07 28.09 67.99
Lei Chen 21.13 11.77 60.37 28.96 60.67
表2  在ArnetMiner数据集上的作者重名消歧结果
Macro-F1_citeseerx AuthorList AuthorList-NMF NDNE ADNE 本文方法
J Lee 6.41 6.25 42.58 6.62 21.12
S Lee 4.94 4.93 39.79 6.02 33.45
Y Chen 9.45 7.20 47.52 10.07 26.98
C Chen 11.20 4.92 35.63 7.89 18.03
J Smith 9.75 8.51 35.81 9.02 24.47
A Gupta 3.93 5.20 41.14 5.73 23.63
J Martin 17.17 13.83 53.98 22.58 41.05
D Johnson 12.91 15.23 28.55 17.85 24.07
A Kumar 25.67 21.78 35.74 17.96 14.33
M Brown 17.80 19.48 46.11 29.04 24.68
表3  在CiteSeerX数据集上的作者重名消歧结果
Macro-F1_dblp AuthorList AuthorList-NMF NDNE ADNE 本文
方法
Wei Wang 12.94 2.37 70.30 12.73 29.56
Yi Zhang 24.91 10.89 34.68 39.31 31.98
Jian Zhang 30.43 13.46 33.83 33.13 23.52
Jing Wang 16.67 11.92 77.00 58.33 67.71
Lei Zhang 5.94 9.68 50.54 8.96 19.28
Wei Li 18.94 4.67 42.52 31.45 32.03
Yang Wang 16.07 12.52 39.98 30.68 47.67
Minsoo Kim 17.73 21.08 43.24 33.85 52.34
Rui Wang 32.16 11.04 50.55 25.38 55.83
Jun Sun 17.42 16.97 58.04 24.57 40.63
表4  在DBLP数据集上的作者重名消歧结果
网络特征 ArnetMiner CiteSeerX DBLP
平均文献条目数 197.9 733.0 141.7
平均真实作者数 61.4 43.2 13.7
平均节点数(作者网络) 323.5 681.0 160.0
平均边数(作者网络) 600.5 1763.4 426.9
平均节点度数(作者网络) 3.7 4.7 5.0
平均边数(文献网络) 783.7 36541.7 2338.4
平均节点度数(文献网络) 8.1 83.1 19.4
表5  三组数据集的网络统计特征
图3  学习迭代轮次对模型效果的影响
图4  表示向量维数对模型效果的影响
图5  LINE相似度选择对模型效果的影响
图6  DeepWalk与LINE交换训练对象的影响
[1] 章顺瑞, 游宏梁 . 现代图书情报技术[J]. 现代图书情报技术, 2010(11):64-68.
( Zhang Shunrui, You Hongliang . Chinese People Name Disambiguation by Hierarchical Clustering[J]. New Technology of Library and Information Service, 2010(11):64-68.)
[2] 肖晶, 梁冰, 张晓丹 , 等. 现代图书情报技术[J]. 现代图书情报技术, 2012(5):55-59.
( Xiao Jing, Liang Bing, Zhang Xiaodan , et al. Author Disambiguation Rules and Algorithm for Article Level Data[J]. New Technology of Library and Information Service, 2012(5):55-59.)
[3] 刘斌, 赵升, 孙笑明 , 等. 我国专利数据中发明家姓名消歧算法研究[J]. 情报学报, 2016,35(4):405-414.
( Liu Bin, Zhao Sheng, Sun Xiaoming , et al. Research on Inventors’ Names Disambiguation Algorithm in Chinese Patent Data[J]. Journal of the China Society for Scientific and Technical Information, 2016,35(4):405-414.)
[4] 周杰, 李弼程, 唐永旺 . 基于关键证据与E 2LSH的增量式人名聚类消歧方法 [J]. 情报学报, 2016,35(7):714-722.
( Zhou Jie, Li Bicheng, Tang Yongwang . Incremental Clustering Method Based on Key Evidence and E 2LSH for Person Name Disambiguation [J]. Journal of the China Society for Scientific and Technical Information, 2016,35(7):714-722.)
[5] 郭舒 . 现代图书情报技术[J]. 现代图书情报技术, 2013(7/8):69-74.
( Guo Shu . Research on Author Name Disambiguation Algorithm in the Literature Database[J]. New Technology of Library and Information Service, 2013(7/8):69-74.)
[6] Han H, Giles L, Zha H , et al. Two Supervised Learning Approaches for Name Disambiguation in Author Citations [C]//Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital libraries, Tucson, Arizona, USA. New York, USA: ACM, 2004: 296-305.
[7] Giles C L, Zha H, Han H . Name Disambiguation in Author Citations Using a K-way Spectral Clustering Method [C]//Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, Denver, Colorado, USA. New York, USA: ACM, 2005: 334-343.
[8] Tang J, Fong A C M, Wang B , et al. A Unified Probabilistic Framework for Name Disambiguation in Digital Library[J]. IEEE Transactions on Knowledge and Data Engineering, 2012,24(6):975-987.
[9] Hermansson L, Kerola T, Johansson F , et al. Entity Disambiguation in Anonymized Graphs Using Graph Kernels [C]//Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, San Francisco, California, USA. New York, USA: ACM, 2013: 1037-1046.
[10] Saha T K, Zhang B, Hasan M A . Name Disambiguation from Link Data in a Collaboration Graph Using Temporal and Topological Features[J]. Social Network Analysis and Mining, 2015,5(1):1-14.
[11] 涂存超, 杨成, 刘知远 , 等. 网络表示学习综述[J]. 中国科学:信息科学, 2017,47(8):32-48.
( Tu Cunchao, Yang Cheng, Liu Zhiyuan , et al. Network Representation Learning: An Overview[J]. Scientia Sinica (Informationis), 2017,47(8):32-48.)
[12] Perozzi B, Al-Rfou R, Skiena S . DeepWalk: Online Learning of Social Representations [C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, USA. 2014: 701-710.
[13] Mikolov T, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[14] Mikolov T, Sutskever I, Chen K , et al. Distributed Representations of Words and Phrases and Their Compositionality [C]//Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada, USA. USA: Curran Associates, 2013: 3111-3119.
[15] Grover A, Leskovec J . Node2vec: Scalable Feature Learning for Networks [C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, USA. 2016: 855-864.
[16] 陈丽, 朱裴松, 钱铁云 , 等. 基于边采样的网络表示学习模型[J]. 软件学报, 2018,29(3):756-771.
( Chen Li, Zhu Peisong, Qian Tieyun , et al. Edge Sampling Based Network Embedding Model[J]. Journal of Software, 2018,29(3):756-771.)
[17] Tang J, Qu M, Wang M , et al. LINE: Large-scale Information Network Embedding [C]// Proceedings of the 24th International Conference on World Wide Web, Florence, Italy. 2015: 1067-1077.
[18] Wang D, Peng C, Zhu W . Structural Deep Network Embedding [C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, USA. 2016: 1225-1234.
[19] Yang C, Liu Z, Zhao D , et al. Network Representation Learning with Rich Text Information[C]// Proceedings of the 24th International Conference on Artificial Intelligence, Buenos Aires, Argentina. San Francisco, California, USA: AAAI Press, 2015: 2111-2117.
[20] Tu C, Liu H, Liu Z , et al. CANE: Context-Aware Network Embedding for Relation Modeling [C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada. ACL, 2017: 1722-1731.
[21] 刘正铭, 马宏, 刘树新 , 等. 一种融合节点文本属性信息的网络表示学习算法[J]. 计算机工程, 2018,44(11):165-171.
( Liu Zhengming, Ma Hong, Liu Shuxin , et al. A Network Representation Learning Algorithm Fusing with Textual Attribute Information of Nodes[J]. Computer Engineering, 2018,44(11):165-171.)
[22] ArnetMiner Name Disambiguation Dataset [EB/OL]. [2019-01-01].https://www.aminer.cn/disambiguation.
[23] CiteSeerX Name Disambiguation Dataset [EB/OL]. [2019-01-01]. http://clgiles.ist.psu.edu/data/.
[24] Xu J, Shen S Q, Li D S , et al. A Network-embedding Based Method for Author Disambiguation [C]// Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino, Italy. New York, USA: ACM, 2018: 1735-1738.
[25] Zhang B, Hasan M A . Name Disambiguation in Anonymized Graphs Using Network Embedding [C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management, Singapore. New York, USA: ACM, 2017: 1239-1248.
[1] 余传明,李浩男,王曼怡,黄婷婷,安璐. 基于深度学习的知识表示研究:网络视角*[J]. 数据分析与知识发现, 2020, 4(1): 63-75.
[2] 隋明爽,崔雷. 结合多种特征的CRF模型用于化学物质-疾病命名实体识别[J]. 现代图书情报技术, 2016, 32(10): 91-97.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn