Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (8): 13-24     https://doi.org/10.11925/infotech.2096-3467.2021.0253
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于异质信息嵌入与RNN聚类参数预测的作者姓名消歧方法*
王若琳1,牛振东1,2(),蔺奇卡3,朱一凡1,邱萍1,陆浩4,刘东磊1
1北京理工大学计算机学院 北京 100081
2北京理工大学图书馆 北京 100081
3西安交通大学计算机科学与技术学院 西安 710049
4中国科学院自动化研究所 北京 100190
Disambiguating Author Names with Embedding Heterogeneous Information and Attentive RNN Clustering Parameters
Wang Ruolin1,Niu Zhendong1,2(),Lin Qika3,Zhu Yifan1,Qiu Ping1,Lu Hao4,Liu Donglei1
1School of Computer, Beijing Institute of Technology, Beijing 100081, China
2Beijing Institute of Technology Library, Beijing 100081, China
3School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China
4Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
全文: PDF (1363 KB)   HTML ( 6
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 针对传统方法利用文本特征提取或文章与合著者之间的关系信息,导致高阶特征缺失的问题,提出学术文献领域下的姓名消歧方法,用于区分拥有相同姓名的多个学者。【方法】 提出一种名为论文嵌入网络(PaperEmbNet)的统一特征提取框架,为每个作者姓名构建学术异质信息网络,并融合内容信息和关系信息。在此基础上,设计一种基于注意力机制的循环神经网络聚类参数预测算法(AR4CPM),进行同名作者聚类个数的预测,并基于该参数,使用层次凝聚聚类算法实现消歧。【结果】 在AMiner-AND数据集上的实验结果表明,所提方法在Macro-F1评分上相比次优模型最大提升4.75百分点,平均训练时间较对比方法短5~10 min。【局限】 需在多语种环境下进一步验证。【结论】 基于异质信息嵌入与RNN聚类参数预测的消歧方法,借助构建的学术异质信息网络充分捕获论文的内容和关系特征,在作者姓名消歧任务上验证了其有效性。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
王若琳
牛振东
蔺奇卡
朱一凡
邱萍
陆浩
刘东磊
关键词 姓名消歧学术异质信息网络图嵌入聚类    
Abstract

[Objective] This paper proposes a name disambiguation method for scientific literature, aiming to distinguish scholars with the same name. The existing solutions utilizes document feature extraction or relationship between documents and co-authors, which loses higher-order attributes. [Methods] First, we established a unified feature extraction framework of Paper Embedding Network (PaperEmbNet), which combined content and relationship to build an academic heterogeneous information network for each author. Then, we designed a Clustering Parameters Method (AR4CPM) based on the Attentive Recurrent Neural Network to estimate the clustering number directly. Finally, we used the Hierarchical agglomerative clustering algorithm (HAC) to disambiguate author names with the predicted number as the preset parameter. [Results] We examined the proposed model with the AMiner-AND dataset and found the macro-F1 score was up to 4.75% higher than the suboptimal model, and the average training time was 5-10 minutes shorter than the existing baselines. [Limitations] We need to evaluate the performance of the proposed method with multilingual environment. [Conclusions] The proposed approach could effectively conduct the name disambiguation tasks.

Key wordsName Disambiguation    Academic Heterogeneous Information Network    Graph Embedding    Clustering
收稿日期: 2021-03-12      出版日期: 2021-09-15
ZTFLH:  TP391  
基金资助:*国家重点研发计划项目(2019YFB1406302)
通讯作者: 牛振东 ORCID:0000-0002-0576-7572     E-mail: zniu@bit.edu.cn
引用本文:   
王若琳, 牛振东, 蔺奇卡, 朱一凡, 邱萍, 陆浩, 刘东磊. 基于异质信息嵌入与RNN聚类参数预测的作者姓名消歧方法*[J]. 数据分析与知识发现, 2021, 5(8): 13-24.
Wang Ruolin, Niu Zhendong, Lin Qika, Zhu Yifan, Qiu Ping, Lu Hao, Liu Donglei. Disambiguating Author Names with Embedding Heterogeneous Information and Attentive RNN Clustering Parameters. Data Analysis and Knowledge Discovery, 2021, 5(8): 13-24.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0253      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I8/13
Fig.1  姓名消歧任务框架
Fig.2  基于异质信息嵌入和RNN聚类参数预测的姓名消歧框架
符号 描述
i 作者姓名 i
P i 作者姓名为 i的文章集合
p j i 和作者姓名 i关联的文章 j
I j i 文章 j的内容特征集合
R j i 文章 j的关系特征集合
C i 作者姓名 i的集群
a k 真实世界中的作者
Table 1  符号表
姓名 本文方法 AMiner全局方法 GHOST方法 Zhang等方法 基于规则的方法
Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1
Xu Xu 70.11 56.19 58.39 70.40 46.04 54.95 61.34 21.79 32.15 47.73 39.98 43.51 10.75 97.23 19.35
Rong Yu 64.72 43.88 52.30 47.46 41.05 44.03 92.00 36.41 52.17 66.53 36.90 47.47 30.81 97.79 46.86
Yong Tian 65.74 46.87 54.73 69.69 46.68 54.51 86.94 54.58 67.06 73.18 56.34 63.66 10.37 93.79 18.67
Lu Han 66.10 45.87 54.16 70.01 44.75 53.54 69.72 17.39 27.84 46.05 17.95 25.83 13.66 89.16 23.69
Lin Huang 58.92 41.24 48.52 47.60 41.13 44.13 86.15 17.25 28.74 69.43 33.13 44.86 13.86 99.46 24.33
Kexin Xu 61.04 41.91 49.70 48.47 41.33 44.61 92.90 28.52 43.64 85.74 44.13 58.27 91.45 99.60 95.35
Wei Quan 67.67 44.54 53.72 70.65 45.16 53.56 86.42 27.80 42.07 74.41 33.94 46.62 28.16 93.80 43.32
Tao Deng 74.55 44.64 55.84 74.50 45.53 55.71 73.33 24.50 36.73 55.25 27.93 37.11 16.30 95.16 27.84
Hongbin Li 60.60 41.48 49.25 83.83 53.46 64.72 56.29 29.12 38.39 65.79 52.86 58.62 13.25 96.41 23.30
Hua Bai 65.58 56.37 60.67 78.93 48.29 59.40 83.06 29.54 43.58 54.93 35.97 43.47 25.47 98.51 40.47
Meiling Chen 72.04 52.63 60.82 47.50 41.24 44.15 86.11 23.85 37.35 79.22 25.15 38.18 59.55 82.07 69.02
Yanqing Wang 72.46 48.85 56.68 39.41 58.17 47.16 80.79 40.39 53.86 72.73 42.62 53.74 25.72 62.47 36.44
XudongZhang 74.92 48.39 58.80 70.48 45.68 53.82 85.75 7.23 13.34 55.63 8.11 14.16 63.22 17.94 27.95
Qiang Shi 71.26 40.01 51.23 72.43 46.78 55.78 53.72 26.80 35.76 43.33 37.99 40.49 28.79 93.89 44.06
Min Zheng 68.44 47.43 56.03 72.01 47.26 55.44 80.50 15.21 25.58 53.62 17.63 26.54 15.41 98.72 26.66
Avg. 78.17 47.88 59.31 68.40 47.42 54.56 81.62 40.43 50.23 70.22 48.72 57.53 44.94 89.30 53.42
Table 2  姓名消歧的整体实验结果
Fig.3  数据集中关于人名“Wang Shui”的t-SNE的嵌入空间可视化图
Fig.4  本文方法与对比方法的时间效率比较
Fig.5  基于准确率,召回率,F1指标的特征贡献分析
Fig.6  嵌入维度对消歧性能的影响
人名 实际值 本文方法 AMiner X_means
Xudong Zhang 69 66.35 55.79 9
Ruijin Liao 6 7.19 3.22 10
Zhifeng Liu 49 45.67 31.88 8
Yongqing Huang 9 8.08 5.26 3
Yongqing Li 30 28.31 39.57 10
Meiling Chen 38 40.25 48.13 12
Xiaoning Zhang 36 35.93 29.30 5
Jiamo Fu 7 7.31 3.78 4
Geng Yang 20 20.90 10.12 5
Zhigang Zeng 18 21.86 10.54 7
Table 3  聚类大小预测结果
[1] Bekkerman R, McCallum A. Disambiguating Web Appearances of People in a Social Network[C]// Proceedings of the 14th International Conference on World Wide Web. 2005: 463-470.
[2] Hermansson L, Kerola T, Johansson F, et al. Entity Disambiguation in Anonymized Graphs Using Graph Kernels[C]// Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 2013: 1037-1046.
[3] Kanani P, McCallum A, Pal C. Improving Author Coreference by Resource-bounded Information Gathering from the Web[C]// Proceedings of the 20th International Joint Conference on Artifical Intelligence. 2007: 429-434.
[4] Steorts R C, Ventura S L, Sadinle M, et al. A Comparison of Blocking Methods for Record Linkage[C]// Proceedings of International Conference on Privacy in Statistical Databases. Springer International Publishing, 2014: 253-268.
[5] Yoshida M, Ikeda M, Ono S, et al. Person Name Disambiguation by Bootstrapping[C]// Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2010: 10-17.
[6] 付媛, 朱礼军, 韩红旗. 姓名消歧方法研究进展[J]. 情报工程, 2016, 2(1):53-58.
[6] ( Fu Yuan, Zhu Lijun, Han Hongqi. A Survey of Name Disambiguation[J]. Technology Intelligence Engineering, 2016, 2(1):53-58.)
[7] Tang J, Fong A C M, Wang B, et al. A Unified Probabilistic Framework for Name Disambiguation in Digital Library[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(6):975-987.
doi: 10.1109/TKDE.2011.13
[8] Han H, Giles L, Zha H Y, et al. Two Supervised Learning Approaches for Name Disambiguation in Author Citations[C]// Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries. 2004: 296-305.
[9] Sain S R. The Nature of Statistical Learning Theory[J]. Technometrics, 1996, 38(4):409.
[10] Huang J, Ertekin S, Giles C L. Efficient Name Disambiguation for Large-Scale Databases[C]// Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2006: 536-544.
[11] Lee D, On B W, Kang J, et al. Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries[C]// Proceedings of the 2nd International Workshop on Information Quality in Information Systems. 2005: 69-76.
[12] Zhang B C, Hasan M A. Name Disambiguation in Anonymized Graphs Using Network Embedding[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 1239-1248.
[13] 余传明, 钟韵辞, 林奥琛, 等. 基于网络表示学习的作者重名消歧研究[J]. 数据分析与知识发现, 2020, 4(2/3):48-59.
[13] ( Yu Chuanming, Zhong Yunci, Lin Aochen, et al. Author Name Disambiguation with Network Embedding[J]. Data Analysis and Knowledge Discovery, 2020, 4(2/3):48-59.)
[14] 沈喆, 王毅, 姚毅凡, 等. 面向学术文献的作者名消歧方法研究综述[J]. 数据分析与知识发现, 2020, 4(8):15-27.
[14] ( Shen Zhe, Wang Yi, Yao Yifan, et al. Author Name Disambiguation Techniques for Academic Literature: A Review[J]. Data Analysis and Knowledge Discovery, 2020, 4(8):15-27.)
[15] Wang H W, Wang R J, Wen C, et al. Author Name Disambiguation on Heterogeneous Information Network with Adversarial Representation Learning[C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2020: 238-245.
[16] Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online Learning of Social Representations[C]// Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 701-710.
[17] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[C]// Proceedings of the International Conference on Learning Representations. 2013.
[18] Grover A, Leskovec J. Node2Vec: Scalable Feature Learning for Networks[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 855-864.
[19] Shi C, Li Y T, Zhang J W, et al. A Survey of Heterogeneous Information Network Analysis[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(1):17-37.
doi: 10.1109/TKDE.2016.2598561
[20] Chang S Y, Han W, Tang J L, et al. Heterogeneous Network Embedding via Deep Architectures[C]// Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2015: 119-128.
[21] Yun S, Jeong M, Kim R, et al. Graph Transformer Networks[C]// Proceedings of the 33rd Conference on Neural Information Processing Systems. 2019: 11960-11970.
[22] Wang X, Ji H Y, Shi C, et al. Heterogeneous Graph Attention Network[C]// Proceedings of the 2019 International Conference on World Wide Web. 2019: 2022-2032.
[23] Shi C, Hu B B, Zhao W X, et al. Heterogeneous Information Network Embedding for Recommendation[J]. IEEE Transactions on Knowledge and Data Engineering, 2019, 31(2):357-370.
doi: 10.1109/TKDE.2018.2833443
[24] Le Q, Mikolov T. Distributed Representations of Sentences and Documents[C]// Proceedings of the 31st International Conference on International Conference on Machine Learning. 2014: 1188-1196.
[25] Tang J, Qu M, Wang M Z, et al. LINE: Large-scale Information Network Embedding[C]// Proceedings of the 24th International Conference on World Wide Web. 2015: 1067-1077.
[26] Tenenbaum J B, Silva V D, Langford J C. A Global Geometric Framework for Nonlinear Dimensionality Reduction[J]. Science, 2000, 290(5500):2319-2323.
pmid: 11125149
[27] Belkin M, Niyogi P. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering[C]// Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. 2001: 585-591.
[28] Pelleg D, Moore A W. X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters[C]// Proceedings of the 17th International Conference on Machine Learning. 2000: 727-734.
[29] Zhang Y T, Zhang F J, Yao P R, et al. Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop[C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2018: 1002-1011.
[30] Cho K, van Merrienboer B, Gulcehre C, et al. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1724-1734.
[31] Bahdanau D, Cho K, Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate[C]// Proceedings of the 3rd International Conference on Learning Representations. 2015.
[32] Fan X M, Wang J Y, Pu X, et al. On Graph-Based Name Disambiguation[J]. Journal of Data and Information Quality, 2011, 2(2):Article No.10.
[1] 王晰巍,贾若男,韦雅楠,张柳. 多维度社交网络舆情用户群体聚类分析方法研究*[J]. 数据分析与知识发现, 2021, 5(6): 25-35.
[2] 卢利农,祝忠明,张旺强,王小春. 基于Lingo3G聚类算法的机构知识库跨库知识整合与知识指纹服务实现[J]. 数据分析与知识发现, 2021, 5(5): 127-132.
[3] 张梦瑶, 朱广丽, 张顺香, 张标. 基于情感分析的微博热点话题用户群体划分模型 *[J]. 数据分析与知识发现, 2021, 5(2): 43-49.
[4] 于丰畅,程齐凯,陆伟. 基于几何对象聚类的学术文献图表定位研究[J]. 数据分析与知识发现, 2021, 5(1): 140-149.
[5] 温萍梅,叶志炜,丁文健,刘颖,徐健. 命名实体消歧研究进展综述*[J]. 数据分析与知识发现, 2020, 4(9): 15-25.
[6] 邬金鸣,侯跃芳,崔雷. 基于医学主题词标引规则的词共现聚类分析结果自动判读和表达的研究[J]. 数据分析与知识发现, 2020, 4(9): 133-144.
[7] 席运江, 杜蝶蝶, 廖晓, 仉学红. 基于超网络的企业微博用户聚类研究及特征分析*[J]. 数据分析与知识发现, 2020, 4(8): 107-118.
[8] 杨旭,钱晓东. 基于改进的Vicsek模型的社会网络同步聚类算法*[J]. 数据分析与知识发现, 2020, 4(4): 119-128.
[9] 熊回香,李晓敏,李跃艳. 基于图书评论属性挖掘的群组推荐研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 214-222.
[10] 魏家泽,董诚,何彦青,刘志辉,彭柯芸. 基于均衡段落和分话题向量的新闻热点话题检测研究*[J]. 数据分析与知识发现, 2020, 4(10): 70-79.
[11] 赵华茗,余丽,周强. 基于均值漂移算法的文本聚类数目优化研究 *[J]. 数据分析与知识发现, 2019, 3(9): 27-35.
[12] 李珊,姚叶慧,厉浩,刘洁,嘎玛白姆. 基于ISA联合聚类的组推荐算法研究 *[J]. 数据分析与知识发现, 2019, 3(8): 77-87.
[13] 李柯,佐々木勇和. 基于多维小波聚类的空间文本数据情感分布分析[J]. 数据分析与知识发现, 2019, 3(7): 14-22.
[14] 周成,魏红芹. 专利价值评估与分类研究*——基于自组织映射支持向量机[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[15] 陆泉,朱安琪,张霁月,陈静. 中文网络健康社区中的用户信息需求挖掘研究*——以求医网肿瘤板块数据为例[J]. 数据分析与知识发现, 2019, 3(4): 22-32.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn