Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (4): 60-68     https://doi.org/10.11925/infotech.2096-3467.2021.0805
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
一种基于异质信息网络的学术文献作者重名消歧方法*
邓启平(),陈卫静,嵇灵,张宇娥
电子科技大学图书馆 成都 611731
Author Name Disambiguation Based on Heterogeneous Information Network
Deng Qiping(),Chen Weijing,Ji Ling,Zhang Yu’e
Library of University of Electronic Science and Technology of China, Chengdu 611731, China
全文: PDF (825 KB)   HTML ( 25
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 充分利用学术文献中的实体关系数据解决作者重名消歧问题。【方法】 从文献信息中抽取多种类型节点及其关系构建异质信息网络,采用网络表示学习方法获取作者节点的表示向量并利用聚类分析得到初步划分,最后基于强规则匹配融合多个聚类簇得到消歧结果。【结果】 在构建的Web of Science数据集下进行测试,本文方法的K-Metric平均值达0.842,较对比方法提升了63.18%,即使不考虑强规则匹配依然提升了34.69%。【局限】 该方法需要利用引文信息,应用场景具有一定的局限性。【结论】 基于异质信息网络,利用更丰富的实体关系对作者节点进行表示学习,能有效改善作者重名消歧的效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
邓启平
陈卫静
嵇灵
张宇娥
关键词 重名消歧关系数据异质信息网络网络表示学习    
Abstract

[Objective] The paper tries to improve author name disambiguation with entity relationship data from academic literature. [Methods] First, we extracted multi-type nodes and their relationships from literature to construct a heterogeneous information network (HIN). Then, we applied representation learning to obtain the latent vectors of authors, and used clutering analysis to get a preliminary division. Finally, we merged several clusters based on strong rule matching to obtain the disambiguation. [Results] We examined the new model with dataset from the Web of Science. The K-Metric mean value was 0.842, a 63.18% increase over the baseline model. Without strong rule matching, the improvement also reached 34.69%. [Limitations] The proposed model requires citation information, which limited its application scenarios. [Conclusions] Our new method could effectively improve the performance of author name disambiguation.

Key wordsAuthor Name Disambiguation    Relational Data    Heterogeneous Information Network    Network Representation Learning
收稿日期: 2021-08-06      出版日期: 2022-05-12
ZTFLH:  TP391  
基金资助:*电子科技大学2021年度“双一流”建设研究支持计划项目(SYLYJ2021213)
通讯作者: 邓启平,ORCID:0000-0001-7078-2026     E-mail: dengqp@uestc.edu.cn
引用本文:   
邓启平, 陈卫静, 嵇灵, 张宇娥. 一种基于异质信息网络的学术文献作者重名消歧方法*[J]. 数据分析与知识发现, 2022, 6(4): 60-68.
Deng Qiping, Chen Weijing, Ji Ling, Zhang Yu’e. Author Name Disambiguation Based on Heterogeneous Information Network. Data Analysis and Knowledge Discovery, 2022, 6(4): 60-68.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0805      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I4/60
Fig.1  异质文献信息网络示例
Fig.2  作者重名消歧方法框架
作者姓名 相关论文量 真实作者数量
Hongbin Liang 179 10
Guorong Chen 267 14
Qi Hu 142 42
Jian Du 149 45
Xi Huang 233 45
Jia Xu 444 87
Table 1  实验数据集
作者姓名 本文方法 APV方法 Non-SFM方法
ACP AAP K-Metric ACP AAP K-Metric ACP AAP K-Metric
Hongbin Liang 0.987 0.911 0.948 0.889 0.237 0.459 0.938 0.379 0.596
Guorong Chen 0.918 0.848 0.882 0.911 0.161 0.383 0.919 0.286 0.513
Qi Hu 0.816 0.915 0.864 0.323 0.343 0.333 0.816 0.915 0.864
Jian Du 0.771 0.969 0.864 0.589 0.836 0.701 0.772 0.910 0.838
Xi Huang 0.821 0.817 0.819 0.758 0.534 0.636 0.688 0.728 0.708
Jia Xu 0.769 0.591 0.674 0.730 0.469 0.585 0.787 0.540 0.652
平均值 0.847 0.842 0.842 0.700 0.430 0.516 0.820 0.626 0.695
Table 2  作者重名消歧实验结果对比
Fig.3  表示向量维数对不同方法效果的影响
[1] 周慧, 赵中英, 李超. 面向异质信息网络的表示学习方法研究综述[J]. 计算机科学与探索, 2019, 13(7):1081-1093.
[1] ( Zhou Hui, Zhao Zhongying, Li Chao. Survey on Representation Learning Methods Oriented to Heterogeneous Information Network[J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(7):1081-1093.)
[2] Tang J, Qu M, Mei Q Z. PTE: Predictive Text Embedding Through Large-Scale Heterogeneous Text Networks[C]//Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2015: 1165-1174.
[3] 许海云, 董坤, 隗玲, 等. 科学计量中多源数据融合方法研究述评[J]. 情报学报, 2018, 37(3):318-328.
[3] ( Xu Haiyun, Dong Kun, Wei Ling, et al. Research on Multi-Source Data Fusion Method in Scientometrics[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(3):318-328.)
[4] Dong Y X, Chawla N V, Swami A. Metapath2vec: Scalable Representation Learning for Heterogeneous Networks[C]//Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017: 135-144.
[5] Chen Y X, Wang C G. HINE: Heterogeneous Information Network Embedding[C]//Proceedings of the 22nd International Conference on Database Systems for Advanced Applications. 2017: 180-195.
[6] Fu T Y, Lee W C, Lei Z. HIN2Vec: Explore Meta-Paths in Heterogeneous Information Networks for Representation Learning[C]//Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 1797-1806.
[7] Hussein R, Yang D Q, Cudré-Mauroux P. Are Meta-paths Necessary?: Revisiting Heterogeneous Graph Embeddings[C]//Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018: 437-446.
[8] Ma X, Wang R R, Zhang Y, et al. A Name Disambiguation Module for Intelligent Robotic Consultant in Industrial Internet of Things[J]. Mechanical Systems and Signal Processing, 2020, 136:106413.
doi: 10.1016/j.ymssp.2019.106413
[9] Zhang B C, Hasan M A. Name Disambiguation in Anonymized Graphs Using Network Embedding[C]//Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 1239-1248.
[10] 余传明, 钟韵辞, 林奥琛, 等. 基于网络表示学习的作者重名消歧研究[J]. 数据分析与知识发现, 2020, 4(2/3):48-59.
[10] ( Yu Chuanming, Zhong Yunci, Lin Aochen, et al. Author Name Disambiguation with Network Embedding[J]. Data Analysis and Knowledge Discovery, 2020, 4(2/3):48-59.)
[11] Wang H W, Wang R J, Wen C, et al. Author Name Disambiguation on Heterogeneous Information Network with Adversarial Representation Learning[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020: 238-245.
[12] Qiao Z Y, Du Y, Fu Y J, et al. Unsupervised Author Disambiguation Using Heterogeneous Graph Convolutional Network Embedding[C]//Proceedings of 2019 IEEE International Conference on Big Data. 2019: 910-919.
[13] Hussain I, Asghar S. Incremental Author Name Disambiguation Using Author Profile Models and Self-Citations[J]. Turkish Journal of Electrical Engineering & Computer Sciences, 2019, 27(5):3665-3681.
[14] Zhao Z Q, Rollins J, Bai L G, et al. Incremental Author Name Disambiguation for Scientific Citation Data[C]//Proceedings of 2017 IEEE International Conference on Data Science and Advanced Analytics. 2017: 175-183.
[15] Frey B J, Dueck D. Clustering by Passing Messages Between Data Points[J]. Science, 2007, 315(5814):972-976.
doi: 10.1126/science.1136800
[16] Shin D, Kim T, Choi J, et al. Author Name Disambiguation Using a Graph Model with Node Splitting and Merging Based on Bibliographic Information[J]. Scientometrics, 2014, 100(1):15-50.
doi: 10.1007/s11192-014-1289-4
[17] Zhang Y T, Zhang F J, Yao P R, et al. Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018: 1002-1011.
[1] 王若琳, 牛振东, 蔺奇卡, 朱一凡, 邱萍, 陆浩, 刘东磊. 基于异质信息嵌入与RNN聚类参数预测的作者姓名消歧方法*[J]. 数据分析与知识发现, 2021, 5(8): 13-24.
[2] 张鑫,文奕,许海云. 一种融合表示学习与主题表征的作者合作预测模型*[J]. 数据分析与知识发现, 2021, 5(3): 88-100.
[3] 沈喆, 王毅, 姚毅凡, 成颖. 面向学术文献的作者名消歧方法研究综述*[J]. 数据分析与知识发现, 2020, 4(8): 15-27.
[4] 余传明,钟韵辞,林奥琛,安璐. 基于网络表示学习的作者重名消歧研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 48-59.
[5] 丁勇,陈夕,蒋翠清,王钊. 一种融合网络表示学习与XGBoost的评分预测模型*[J]. 数据分析与知识发现, 2020, 4(11): 52-62.
[6] 余传明,李浩男,王曼怡,黄婷婷,安璐. 基于深度学习的知识表示研究:网络视角*[J]. 数据分析与知识发现, 2020, 4(1): 63-75.
[7] 高广尚, 张智雄. 关系数据库中实体解析研究综述[J]. 现代图书情报技术, 2015, 31(7-8): 37-47.
[8] 范云满, 洪娜, 钱庆, 方安. 利用Hadoop/HBase的药物基因组数据云存储实践研究[J]. 现代图书情报技术, 2015, 31(5): 73-79.
[9] 张小飞,蔡亚萍,刘威. 络关系数据智能采集系统的设计与实现——基于Web数据挖掘原理[J]. 现代图书情报技术, 2009, (9): 64-69.
[10] 安璐. 通用关系数据库与模糊数据库的比较研究*[J]. 现代图书情报技术, 2003, 19(5): 62-65.
[11] 沈玮杰. 基于文献结构的自动文摘的初探[J]. 现代图书情报技术, 2002, 18(3): 23-27.
[12] 赵英莉,王源. 基于SQL Server的化学核心期刊数据库的实现[J]. 现代图书情报技术, 2001, 17(3): 41-42.
[13] 王兰成,刘庆辉,袁航. 基于Web数据库模拟实现MILINS的公共检索[J]. 现代图书情报技术, 2000, 16(3): 34-36.
[14] 马自卫,高嵩. MELINETS——一个崛起的中国图书馆自动化信息网络系统[J]. 现代图书情报技术, 2000, 16(1): 8-11.
[15] 王怀兴. 关系数据库的共享、冲突及自适应锁定算法[J]. 现代图书情报技术, 1999, 15(6): 25-27.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn