Please wait a minute...
Advanced Search
数据分析与知识发现  2023, Vol. 7 Issue (5): 71-80     https://doi.org/10.11925/infotech.2096-3467.2022.0576
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于相似特征和关系图优化的姓名消歧*
崔焕庆1,2(),杨峻铸1,宋玮情1
1山东科技大学计算机科学与工程学院 青岛 266590
2高效能服务器和存储技术国家重点实验室 济南 250014
Name Disambiguation Based on Similar Features and Relation Graph Optimization
Cui Huanqing1,2(),Yang Junzhu1,Song Weiqing1
1College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
2State Key Laboratory of High-end Server & Storage Technology, Inspur Group Co., Ltd., Jinan 250014, China
全文: PDF (938 KB)   HTML ( 12
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 充分利用学术文献的特征信息和关系信息解决作者姓名消歧问题。【方法】 提出了一种特征信息嵌入和关系图优化相结合的姓名消歧方法。首先基于文本信息提取文献特征,通过表示学习得到文献的嵌入向量,然后挖掘文献之间的关系信息并分析关系强弱,构建4个关系图以优化每篇文献嵌入向量,最后使用凝聚层次聚类算法得到消歧结果。【结果】 在AMiner-na数据集上的实验结果表明,本文方法得到的F1分数平均值为68.78%,相比次优方法提升了1.81个百分点。【局限】 注重所有作者的平均消歧效果,部分作者消歧效果有待提高。【结论】 本文方法能够充分利用文献关系信息,综合特征信息有效地提升作者姓名消歧的效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
崔焕庆
杨峻铸
宋玮情
关键词 姓名消歧特征提取表示学习关系抽取聚类    
Abstract

[Objective] The paper aims to fully utilize the feature information and relation information of academic literature to improve author name disambiguation. [Methods] We proposed a name disambiguation method combining feature information embedding and relation graph optimization. First, we extracted feature information from literature and applied representation learning to obtain the embedding vectors. Then, we mined the relationship information between literatures, and also constructed four relation graphs to optimize the embedding vectors of each literature. Finally, we used hierarchical agglomerative clustering algorithm to obtain the disambiguation results. [Results] We examined the new model on AMiner-na dataset and found its average F1 score reached 68.78%, which was 1.81 percent points higher than the second best method. [Limitations] The proposed method focuses on the average disambiguation effect of all authors, and the disambiguation effect of some authors needs to be improved. [Conclusions] The proposed method can fully utilize the literature relation information, and effectively improve the effect of author name disambiguation.

Key wordsName Disambiguation    Feature Extraction    Representation Learning    Relation Extraction    Clustering
收稿日期: 2022-06-05      出版日期: 2022-11-09
ZTFLH:  TP391  
  G250  
基金资助:*山东省自然科学基金项目的研究成果之一(ZR2021LZH004)
通讯作者: 崔焕庆,ORCID:0000-0002-9251-680X,E-mail:cuihq@sdust.edu.cn。   
引用本文:   
崔焕庆, 杨峻铸, 宋玮情. 基于相似特征和关系图优化的姓名消歧*[J]. 数据分析与知识发现, 2023, 7(5): 71-80.
Cui Huanqing, Yang Junzhu, Song Weiqing. Name Disambiguation Based on Similar Features and Relation Graph Optimization. Data Analysis and Knowledge Discovery, 2023, 7(5): 71-80.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0576      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I5/71
Fig.1  姓名消歧框架
Fig.2  特征学习框架
文献信息 抽取关系 关系图
作者 共同作者 共同作者图
标题 相似标题 相似标题图
摘要 相似摘要 相似摘要图
关键字 相似关键字 相似关键字图
出版刊物 共同刊物 共同刊物图
作者单位 共同单位 共同单位图
年份 同年 共同年份图
Table 1  关系抽取与关系图构建
Fig.3  不同关系图的消歧结果
关系 关系图 描述
共同作者 共同作者图 G n 表示文献之间存在相同合作作者
共同单位 共同单位图 G o 表示文献作者之间存在相同单位
共同刊物 共同刊物图 G u 表示两篇文献发表于同一刊物
相似专业词 相似专业图 G m 表示两篇文献存在相似专业词
Table 2  关系类型与关系图
作者姓名 本文方法(%) ADES(%) AMiner(%) ADNE(%) ReLU(%)
Pre Rec F1 Pre Rec F1 Pre Rec F1 Pre Rec F1 Pre Rec F1
Xu Xu 66.08 44.45 53.15 43.97 68.61 53.59 69.99 42.29 52.73 8.64 50.01 14.73 61.21 30.39 40.61
Rong Yu 82.79 46.00 59.14 39.48 77.58 52.33 68.74 38.15 49.07 29.06 87.13 43.59 83.03 33.32 47.56
Yong Tian 69.75 52.43 59.86 51.51 58.13 54.62 72.68 49.55 58.93 10.51 51.06 17.44 92.67 43.07 58.81
Lu Han 49.49 25.64 33.78 25.31 51.72 33.98 63.33 31.26 41.86 15.17 53.10 23.59 77.86 11.68 20.31
Lin Huang 67.42 32.60 43.95 52.37 84.72 64.72 80.35 31.42 45.18 10.52 38.44 16.52 99.49 22.80 37.10
Kexin Xu 88.47 98.98 93.42 92.09 90.78 91.43 83.79 53.32 65.17 71.61 97.71 82.65 90.93 67.86 77.72
Wei Quan 46.90 40.44 43.43 38.92 49.47 43.56 43.71 28.66 34.62 26.37 49.52 34.42 97.26 18.33 30.85
Tao Deng 73.00 39.38 51.16 42.01 76.06 54.12 77.05 42.66 54.92 15.99 64.97 25.66 80.55 13.72 23.44
Hongbin Li 68.20 75.74 71.77 58.66 73.90 65.40 72.59 60.22 65.83 10.36 60.69 17.71 63.72 29.90 40.70
Hua Bai 67.65 37.85 48.54 30.82 58.72 40.43 67.68 32.12 43.56 21.11 84.81 33.81 74.55 16.16 26.56
Meiling Chen 50.43 87.01 63.85 44.77 70.12 54.66 69.20 44.39 54.08 21.85 73.68 33.71 100.0 7.91 14.66
Yanqing Wang 29.91 66.88 41.33 72.73 64.82 68.54 63.60 59.22 61.33 15.52 57.80 24.46 100.0 25.97 41.24
Xudong Zhang 60.61 21.48 31.72 21.16 62.87 31.66 82.38 57.50 67.72 57.97 61.42 59.64 90.56 4.59 8.73
Qiang Shi 45.40 37.03 40.79 38.23 54.53 44.85 52.68 34.82 41.93 10.57 32.04 15.90 45.92 28.60 35.25
Min Zheng 48.79 20.64 29.01 18.75 55.34 28.01 66.43 18.24 28.62 13.76 50.87 21.66 87.41 10.68 19.03
平均 72.47 65.45 68.78 60.19 75.48 66.97 74.25 54.07 62.58 38.66 69.20 49.61 81.47 41.02 54.56
Table 3  作者姓名消歧实验结果
Fig.4  嵌入维数对比实验
Fig.5  聚类算法对比实验
不同阶段 Pre/% Rec/% F1/%
Embedding 38.65 25.19 30.50
Feature 71.03 49.62 58.42
Feature+Graph 72.47 65.45 68.78
Table 4  不同阶段有效性分析
[1] Zhang Y T, Zhang F J, Yao P R, et al. Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop[C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2018: 1002-1011.
[2] Louppe G, Al-Natsheh H T, Susik M, et al. Ethnicity Sensitive Author Disambiguation Using Semi-Supervised Learning[C]// Proceedings of International Conference on Knowledge Engineering and the Semantic Web. Berlin, Heidelberg: Springer, 2016: 272-287.
[3] Han H Q, Yao C Q, Fu Y S, et al. Semantic Fingerprints-based Author Name Disambiguation in Chinese Documents[J]. Scientometrics, 2017, 111: 1879-1896.
doi: 10.1007/s11192-017-2338-6
[4] Silva J M B, Silva F. Feature Extraction for the Author Name Disambiguation Problem in a Bibliographic Database[C]// Proceedings of the 32nd ACM Symposium on Applied Computing. New York, USA: ACM, 2017: 783-789.
[5] Fan C, Li Y. Chinese Personal Name Disambiguation Based on Clustering[J]. Wireless Communications & Mobile Computing, 2021, 2021(5): Article ID 3790176.
[6] Fan X M, Wang J Y, Pu X, et al. On Graph-Based Name Disambiguation[J]. Journal of Data and Information Quality, 2011, 2(2): Article No. 10.
[7] Zhang B C, Hasan M A. Name Disambiguation in Anonymized Graphs Using Network Embedding[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. New York, USA: ACM, 2017: 1239-1248.
[8] Xu J, Shen S Q, Li D S, et al. A Network-embedding Based Method for Author Disambiguation[C]// Proceedings of the 27th ACM International Conference on Information and Knowledge Management. New York, USA: ACM, 2018: 1735-1738.
[9] Qiao Z Y, Du Y, Fu Y J, et al. Unsupervised Author Disambiguation Using Heterogeneous Graph Convolutional Network Embedding[C]// Proceedings of 2019 IEEE International Conference on Big Data. Piscataway, USA: IEEE, 2019: 910-919.
[10] Hussain I, Asghar S. Author Name Disambiguation by Exploiting Graph Structural Clustering and Hybrid Similarity[J]. Arabian Journal for Science and Engineering, 2018, 43: 7421-7437.
doi: 10.1007/s13369-018-3099-0
[11] 余传明, 钟韵辞, 林奥琛, 等. 基于网络表示学习的作者重名消歧研究[J]. 数据分析与知识发现, 2020, 4(2/3): 48-59.
[11] (Yu Chuanming, Zhong Yunci, Lin Aochen, et al. Author Name Disambiguation with Network Embedding[J]. Data Analysis and Knowledge Discovery, 2020, 4(2/3): 48-59.)
[12] 邓启平, 陈卫静, 嵇灵, 等. 一种基于异质信息网络的学术文献作者重名消歧方法[J]. 数据分析与知识发现, 2022, 6(4): 60-68.
[12] (Deng Qiping, Chen Weijing, Ji Ling, et al. Author Name Disambiguation Based on Heterogeneous Information Network[J]. Data Analysis and Knowledge Discovery, 2022, 6(4): 60-68.)
[13] Ma Y Y, Wu Y L, Lu C Q. A Graph-Based Author Name Disambiguation Method and Analysis via Information Theory[J]. Entropy, 2020, 22(4). https://doi.org/10.3390/e22040416.
doi: https://doi.org/10.3390/e22040416
[14] Chen Y, Yuan H L, Liu T T, et al. Name Disambiguation Based on Graph Convolutional Network[J]. Scientific Programming, 2021, 2021(4). https://doi.org/10.1155/2021/5577692.
doi: https://doi.org/10.1155/2021/5577692
[15] Pooja K M, Mondal S, Chandra J. Exploiting Similarities Across Multiple Dimensions for Author Name Disambiguation[J]. Scientometrics, 2021, 126(9): 7525-7560.
doi: 10.1007/s11192-021-04101-y
[16] Xiong B, Bao P, Wu Y L. Learning Semantic and Relationship Joint Embedding for Author Name Disambiguation[J]. Neural Computing & Applications, 2021, 33(6): 1987-1998.
[17] 王若琳, 牛振东, 蔺奇卡, 等. 基于异质信息嵌入与RNN聚类参数预测的作者姓名消歧方法[J]. 数据分析与知识发现, 2021, 5(8): 13-24.
[17] (Wang Ruolin, Niu Zhendong, Lin Qika, et al. Disambiguating Author Names with Embedding Heterogeneous Information and Attentive RNN Clustering Parameters[J]. Data Analysis and Knowledge Discovery, 2021, 5(8): 13-24.)
[18] 盛晓光, 王颖, 钱力, 等. 基于图卷积半监督学习的论文作者同名消歧方法研究[J]. 电子与信息学报, 2021, 43(12): 3442-3450.
[18] (Sheng Xiaoguang, Wang Ying, Qian Li, et al. Author Name Disambiguation Based on Semi-supervised Learning with Graph Convolutional Network[J]. Journal of Electronics & Information Technology, 2021, 43(12): 3442-3450.)
[19] 涂世文. 面向学术文献数据的同名作者消歧方法研究[D]. 上海: 华东师范大学, 2020.
[19] (Tu Shiwen. A Study on Methods of Author Name Disambiguation in Academic Literature[D]. Shanghai: East China Normal University, 2020.)
[20] Kim J, Kim J, Owen-Smith J. Ethnicity-based Name Partitioning for Author Name Disambiguation Using Supervised Machine Learning[J]. Journal of the Association for Information Science and Technology, 2021, 72: 979-994.
doi: 10.1002/asi.24459 pmid: 34414251
[21] Kim J, Kim J. Effect of Forename String on Author Name Disambiguation[J]. Journal of the Association for Information Science and Technology, 2020, 71: 839-855.
doi: 10.1002/asi.v71.7
[22] Schroff F, Kalenichenko D, Philbin J. FaceNet: A Unified Embedding for Face Recognition and Clustering[C]// Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, USA: IEEE, 2015: 815-823.
[23] 武永亮, 赵书良, 李长镜, 等. 基于TF-IDF和余弦相似度的文本分类方法[J]. 中文信息学报, 2017, 31(5): 138-145.
[23] (Wu Yongliang, Zhao Shuliang, Li Changjing, et al. Text Classification Method Based on TF-IDF and Cosine Similarity[J]. Journal of Chinese Information Processing, 2017, 31(5): 138-145.)
[24] Name Disambiguation Dataset[EB/OL]. [2021-10-01]. https://www.aminer.cn/na-data.
[1] 徐康, 余胜男, 陈蕾, 王传栋. 基于语言学知识增强的自监督式图卷积网络的事件关系抽取方法*[J]. 数据分析与知识发现, 2023, 7(5): 92-104.
[2] 谢珍, 马建霞, 胡文静. 多维度个人学术轨迹绘制与分析*[J]. 数据分析与知识发现, 2023, 7(2): 129-140.
[3] 曹喆, 郭慧兰, 吴江, 胡忠义. 元宇宙的理想与现实:基于评论挖掘的VR产品用户感知研究*[J]. 数据分析与知识发现, 2023, 7(1): 49-62.
[4] 崔骥, 张金鹏, 包舟, 丁晟春. 基于趋势度分析的科技领域核心主题发展预测*[J]. 数据分析与知识发现, 2022, 6(9): 1-13.
[5] 张军亮, 方雪梅, 张帆, 刘喜文, 朱鹏. 基于复杂网络的医学语义关联研究*[J]. 数据分析与知识发现, 2022, 6(9): 125-137.
[6] 赵鹏武, 李志义, 林小琦. 基于注意力机制和卷积神经网络的中文人物关系抽取与识别*[J]. 数据分析与知识发现, 2022, 6(8): 41-51.
[7] 吴江, 刘涛, 刘洋. 在线社区用户画像及自我呈现主题挖掘——以网易云音乐社区为例*[J]. 数据分析与知识发现, 2022, 6(7): 56-69.
[8] 景慎旗, 赵又霖. 基于医学领域知识和远程监督的医学实体关系抽取研究*[J]. 数据分析与知识发现, 2022, 6(6): 105-114.
[9] 薛菁菁, 秦永彬, 黄瑞章, 任丽娜, 陈艳平. SSVAE:一种补充语义信息的深度变分文本聚类模型*[J]. 数据分析与知识发现, 2022, 6(6): 71-83.
[10] 胡吉明, 郑翔. 基于主题聚类的新媒体政务互动内容摘要生成研究*[J]. 数据分析与知识发现, 2022, 6(6): 95-104.
[11] 周倩, 姚震, 孙博. 基于自适应k均值聚类的距离加权欠采样算法*[J]. 数据分析与知识发现, 2022, 6(5): 127-136.
[12] 郭蕾, 刘文菊, 王赜, 任悦强. 融合谱聚类和多因素影响的兴趣点推荐方法*[J]. 数据分析与知识发现, 2022, 6(5): 77-88.
[13] 邓启平, 陈卫静, 嵇灵, 张宇娥. 一种基于异质信息网络的学术文献作者重名消歧方法*[J]. 数据分析与知识发现, 2022, 6(4): 60-68.
[14] 钱旦敏, 曾婷婷, 常侍艺. 突发公共卫生事件下基于在线健康社区用户画像的用户角色研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 93-104.
[15] 聂卉, 吴晓燕, 林芸. 基于在线问诊记录的抑郁症病患群组划分与特征分析*[J]. 数据分析与知识发现, 2022, 6(2/3): 222-232.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn