Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (5): 10-19     https://doi.org/10.11925/infotech.2096-3467.2021.0189
  综述评介 本期目录 | 过刊浏览 | 高级检索 |
作者名称增量消歧研究综述*
曹思萌,李春旺()
中国科学院文献情报中心 北京 100190; 中国科学院大学经济与管理学院图书情报与档案管理系 北京 100190
Review of Studies on Incremental Name Disambiguation
Cao Simeng,Li Chunwang()
National Science Library, Chinese Academy of Sciences, Beijing 100190, China; Department of Library Information and Archives Management, University of Chinese Academy of Sciences, Beijing 100190, China
全文: PDF (642 KB)   HTML ( 34
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 总结分析作者名称增量消歧研究进展,为相关研究提供参考。【文献范围】 以(“作者” and “名称消歧”)、(“author” and “name disambiguation”)为关键词分别检索谷歌学术、ACM、IEEE、Elsevier、Springer以及知网、维普数据库,经人工筛选、基于种子文献的引文扩展搜索,获取相关文献58篇,其中直接讨论增量消歧的文献30篇、其他相关文献28篇。【方法】 梳理增量消歧研究发展过程、技术框架与基本原则,围绕相似度比较策略、作者分配判断方法、需要关注的问题等分析增量消歧研究发展情况。【结果】 重视特征选择与表示、相似度计算与作者分配方法的研究,需要加强碎片合并、同一作者多主题识别、错误记录纠正等问题研究。【局限】 直接以作者名称增量消歧为研究主题文献数量较少,在支撑综述结果方面存在局限性。【结论】 应加强增量消歧研究,将传统特征工程法与深度学习、人工智能技术相结合,注重解决增量消歧实践中的具体问题。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
曹思萌
李春旺
关键词 作者名称消歧增量消歧相似度    
Abstract

[Objective] This paper analyzes the research on name incremental disambiguation for authors, aiming to provide reference for future studies. [Coverage] We used “author” and “name disambiguation” as keywords to search Google Scholar, ACM, IEEE, Elsevier, Springer, CNKI and VIP databases. After manually screening and extending citation search based on seed documents, a total of 58 articles were retrieved, which included 30 papers directly discussing incremental disambiguation, and 28 other related research. [Methods] We discussed the developments, technical frameworks, and basic principles of incremental disambiguation. We also analyzed the development of incremental disambiguation on similarity comparison strategies, author assignment methods, and other issues.[Results] Popular areas include feature selection and representation, similarity calculation and author assignment methods. However, fragment merging, multi-topic recognition of the same author, and error-correction needs to be strengthened.[Limitations] There were limited studies on direct incremental disambiguation of author names, which could not fully support our results. [Conclusions] The research on incremental disambiguation should be strengthened. Combining traditional feature engineering methods with deep learning and a.pngicial intelligence technology could address more practical issues.

Key wordsAuthor Name Disambiguation    Incremental Disambiguation    Similarity
收稿日期: 2021-03-01      出版日期: 2022-06-21
ZTFLH:  G250  
基金资助:*中国科学院文献情报能力建设专项项目的研究成果之一(Y929090401)
通讯作者: 李春旺,ORCID:0000-0002-6313-6576     E-mail: licw@mail.las.ac.cn
引用本文:   
曹思萌, 李春旺. 作者名称增量消歧研究综述*[J]. 数据分析与知识发现, 2022, 6(5): 10-19.
Cao Simeng, Li Chunwang. Review of Studies on Incremental Name Disambiguation. Data Analysis and Knowledge Discovery, 2022, 6(5): 10-19.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0189      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I5/10
作者分配方法 相似度比较策略 特征表示方式 主要方法或作者 发表年份
基于规则法 全记录 代数模型 INDi[8-9]、Tang等[10]、Han等[11] 2011/2020、2011、2017
集合论模型+代数模型 Protasiewicz等[12]、昌宁等[13] 2016、2020
作者模型 集合论模型 CAND[14-15] 2018/2019
分类法 全记录 代数模型 SLAND[16]、翟晓瑞等[17]、涂世文[18] 2012、2019、2020
概率模型 Zhang等[19-22] 2016/2017/2017/2019
部分记录 代数模型 CONNA[23] 2019
作者模型 代数模型 吴梓明[24] 2020
概率模型 Katsurai等[25]、Zhao等[26] 2016、2017
聚类法 全记录 代数模型 周杰等[27-28]、Zhang等[29] 2016/2016/2016、2018
概率模型 INC[30-31] 2015/2017
部分记录 代数模型 Khabsa[32]、Treeratpituk[33]、MINDi[34] 2012/2015、2014
作者模型 概率模型 IncAD[35] 2015
基于图的方法 全记录 图模型 Qiao等[36] 2019
作者模型 图模型 李娜[37] 2020
Table 1  主要增量消歧方法对比
[1] Chen Y B, Jiang Z Y, Gao J L, et al. A Supervised and Distributed Framework for Cold-Start Author Disambiguation in Large-Scale Publications[J]. Neural Computing and Applications, 2021: 1-16.
[2] Hussain I, Asghar S. A Survey of Author Name Disambiguation Techniques: 2010-2016[J]. The Knowledge Engineering Review, 2017, 32: e22.
doi: 10.1017/S0269888917000182
[3] 沈喆, 王毅, 姚毅凡, 等. 面向学术文献的作者名消歧方法研究综述[J]. 数据分析与知识发现, 2020, 4(8): 15-27.
[3] ( Shen Zhe, Wang Yi, Yao Yifan, et al. Author Name Disambiguation Techniques for Academic Literature: A Review[J]. Data Analysis and Knowledge Discovery, 2020, 4(8): 15-27.)
[4] Ferreira A A, Gonçalves M A, Laender A H F. A Brief Survey of Automatic Methods for Author Name Disambiguation[J]. ACM SIGMOD Record, 2012, 41(2): 15-26.
[5] Delgado A D, Martínez R, Fresno V, et al. A Data Driven Approach for Person Name Disambiguation in Web Search Results[C]// Proceedings of the 25th International Conference on Computational Linguistics. 2014:301-310.
[6] Khabsa M, Treeratpituk P, Giles C L. Large Scale Author Name Disambiguation in Digital Libraries[C]// Proceedings of the 2014 IEEE International Conference on Big Data. IEEE, 2014: 41-42.
[7] Zha H. Spectral Relaxation for K-means Clustering[C]// Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. 2001:1057-1064.
[8] Carvalho A, Ferreira A A, Laender A H F, et al. Incremental Unsupervised Name Disambiguation in Cleaned Digital Libraries[J]. Journal of Information & Data Management, 2011, 2: 289-304.
[9] Ferreira A A, Gonçalves M A, Laender A H F. Automatic Disambiguation of Author Names in Bibliographic Repositories[J]. Synthesis Lectures on Information Concepts, Retrieval, and Services, 2020, 12(1): 1-146.
[10] Tang J, Fong A C M, Wang B, et al. A Unified Probabilistic Framework for Name Disambiguation in Digital Library[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(6): 975-987.
doi: 10.1109/TKDE.2011.13
[11] Han H Q, Yao C Q, Fu Y, et al. Semantic Fingerprints-Based Author Name Disambiguation in Chinese Documents[J]. Scientometrics, 2017, 111(3): 1879-1896.
doi: 10.1007/s11192-017-2338-6
[12] Protasiewicz J, Dadas S. A Hybrid Knowledge-Based Framework for Author Name Disambiguation[C]// Proceedings of the 2016 IEEE International Conference on Systems, Man, and Cybernetics. IEEE, 2016: 594-600.
[13] 昌宁, 窦永香, 徐薇. 基于多源数据的科技文献作者同名消歧研究[J]. 情报科学, 2021, 39(6): 108-116.
[13] ( Chang Ning, Dou Yongxiang, Xu Wei. Disambiguation of Sci-Tech Literature Authors with Multi-Source Data[J]. Information Science, 2021, 39(6): 108-116.)
[14] Hussain I, Asghar S. Resolving Namesakes Using the Author’s Social Network[J]. Turkish Journal of Electrical Engineering & Computer Sciences, 2018, 26: 554-569.
[15] Hussain I, Asghar S. Incremental Author Name Disambiguation Using Author Profile Models and Self-Citations[J]. Turkish Journal of Electrical Engineering & Computer Sciences, 2019, 27(5): 3665-3681.
[16] Veloso A, Ferreira A A, Gonçalves M A, et al. Cost-Effective On-Demand Associative Author Name Disambiguation[J]. Information Processing & Management, 2012, 48(4): 680-697.
doi: 10.1016/j.ipm.2011.08.005
[17] 翟晓瑞, 韩红旗, 张运良, 等. 基于稀疏分布式表征的英文著者姓名消歧研究[J]. 计算机应用研究, 2019, 36(12): 3534-3538.
[17] ( Zhai Xiaorui, Han Hongqi, Zhang Yunliang, et al. Research on English Author Name Disambiguation Based on Sparse Distributed Representation[J]. Application Research of Computers, 2019, 36(12): 3534-3538.)
[18] 涂世文. 面向学术文献数据的同名作者消歧方法研究[D]. 上海: 华东师范大学, 2020.
[18] ( Tu Shiwen. A Study on Methods of Author Name Disambiguation in Academic Literature[D]. Shanghai: East China Normal University, 2020.)
[19] Zhang B C, Dundar M, Hasan M A. Bayesian Non-Exhaustive Classification a Case Study: Online Name Disambiguation Using Temporal Record Streams[C]// Proceedings of the 25th ACM International Conference on Information and Knowledge Management. 2016: 1341-1350.
[20] Zhang B C, Dundar M, Hasan M A. Bayesian Non-Exhaustive Classification for Active Online Name Disambiguation[OL]. arXiv Preprint, arXiv: 1708.04531.
[21] Zhang B C. Towards Name Disambiguation:Relational, Streaming, and Privacy-Preserving Text Data[D]. Indiana, USA: Purdue University, 2017.
[22] Zhang B C, Dundar M, Dave V, et al. Dirichlet Process Gaussian Mixture for Active Online Name Disambiguation by Particle Filter[C]// Proceedings of the 2019 ACM/IEEE Joint Conference on Digital Libraries. 2019: 269-278.
[23] Chen B, Zhang J, Tang J, et al. CONNA: Addressing Name Disambiguation on the Fly[J]. IEEE Transactions on Knowledge and Data Engineering, 2020. DOI: 10.1109/TKDE.2020.3021256.
doi: 10.1109/TKDE.2020.3021256
[24] 吴梓明. 基于学术大数据的学术搜索系统关键技术研究及应用[D]. 广州: 华南理工大学, 2020.
[24] ( Wu Ziming. Research and Application on Big Scholarly Data-Based Key Technique of Academic Search System[D]. Guangzhou: South China University of Technology, 2020.)
[25] Katsurai M, Ohmukai I, Takeda H. Topic Representation of Researchers’ Interests in a Large-Scale Academic Database and Its Application to Author Disambiguation[J]. IEICE Transactions on Information and Systems, 2016, E99. D(4): 1010-1018.
[26] Zhao Z Q, Rollins J, Bai L G, et al. Incremental Author Name Disambiguation for Scie.pngic Citation Data[C]// Proceedings of the 2017 IEEE International Conference on Data Science and Advanced Analytics. 2017: 175-183.
[27] 周杰, 李弼程, 唐永旺. 基于关键证据与E2LSH的增量式人名聚类消歧方法[J]. 情报学报, 2016, 35(7): 714-722.
[27] ( Zhou Jie, Li Bicheng, Tang Yongwang. Incremental Clustering Method Based on Key Evidence and E2LSH for Person Name Disambiguation[J]. Journal of the China Society for Scie.pngic and Technical Information, 2016, 35(7): 714-722.)
[28] 周杰. 基于网络语义资源的命名实体识别与消歧技术研究[D]. 郑州: 解放军信息工程大学, 2016.
[28] ( Zhou Jie. Research on Named Entity Recognition and Disambiguation Based on Network Semantic Resource[D]. Zhengzhou: PLA Information Engineering University, 2016.)
[29] Zhang Y T, Zhang F J, Yao P R, et al. Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018: 1002-1011.
[30] Santana A F, Gonçalves M A, Laender A H F, et al. Incremental Author Name Disambiguation by Exploiting Domain-Specific Heuristics[J]. Journal of the Association for Information Science and Technology, 2017, 68(4): 931-945.
doi: 10.1002/asi.23726
[31] Santana A F, Gonçalves M A, Laender A H F, et al. On the Combination of Domain-Specific Heuristics for Author Name Disambiguation: The Nearest Cluster Method[J]. International Journal on Digital Libraries, 2015, 16(3-4): 229-246.
doi: 10.1007/s00799-015-0158-y
[32] Khabsa M, Treeratpituk P, Giles C L. Online Person Name Disambiguation with Constraints[C]// Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries. 2015: 37-46.
[33] Treeratpituk P. Person Name Disambiguation in the Multicultural and Online Setting[D]. Pennsylvania, USA: The Pennsylvania State University, 2012.
[34] Esperidião L, Ferreira A, Laender A, et al. Reducing Fragmentation in Incremental Author Name Disambiguation[J]. Journal of Information and Data Management, 2014, 5: 293-307.
[35] Qian Y N, Zheng Q H, Sakai T, et al. Dynamic Author Name Disambiguation for Growing Digital Libraries[J]. Information Retrieval Journal, 2015, 18(5): 379-412.
doi: 10.1007/s10791-015-9261-3
[36] Qiao Z Y, Du Y, Fu Y J, et al. Unsupervised Author Disambiguation Using Heterogeneous Graph Convolutional Network Embedding[C]// Proceedings of the 2019 IEEE International Conference on Big Data. IEEE, 2019: 910-919.
[37] 李娜. 作者姓名消歧方法研究与应用[D]. 上海: 华东师范大学, 2020.
[37] ( Li Na. Research and Application on Disambiguating Authors[D]. Shanghai: East China Normal University, 2020.)
[38] Chen Y, Lee S Y M, Huang C R. PolyUHK: A Robust Information Extraction System for Web Personal Names[C]// Proceedings of the 2nd Web People Search Evaluation Workshop. 2009.
[39] Han H, Giles L, Zha H, et al. Two Supervised Learning Approaches for Name Disambiguation in Author Citations[C]// Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries. IEEE, 2004: 296-305.
[40] Aldous D J. Exchangeability and Related Topics[A]//Hennequin P L. École d’Été de Probabilités de Saint-Flour XIII — 1983[M]. Springer, 1985.
[41] 高悦, 王文贤, 杨淑贤. 一种基于狄利克雷过程混合模型的文本聚类算法[J]. 信息网络安全, 2015(11): 60-65.
[41] ( Gao Yue, Wang Wenxian, Yang Shuxian. A Document Clustering Algorithm Based on Dirichlet Process Mixture Model[J]. Netinfo Security, 2015(11): 60-65.)
[42] Chen T Q, Guestrin C. XGBoost: A Scalable Tree Boosting System[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 785-794.
[43] Kim K, Rohatgi S, Giles C L. Hybrid Deep Pairwise Classification for Author Name Disambiguation[C]// Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2019: 2369-2372.
[44] Ferreira A A, Veloso A, Gonçalves M A, et al. Effective Self-Training Author Name Disambiguation in Scholarly Digital Libraries[C]// Proceedings of the 10th Annual Joint Conference on Digital Libraries. 2010: 39-48.
[45] Ferreira A A, Veloso A, Gonçalves M A, et al. Self-Training Author Name Disambiguation for Information Scarce Scenarios[J]. Journal of the Association for Information Science and Technology, 2014, 65(6): 1257-1278.
doi: 10.1002/asi.22992
[46] Ferreira A A, Machado T M, Gonçalves M A. Improving Author Name Disambiguation with User Relevance Feedback[J]. Journal of Information and Data Management, 2012, 3(3): 332-347.
[47] Wang J, Berzins K, Hicks D, et al. A Boosted-Trees Method for Name Disambiguation[J]. Scientometrics, 2012, 93(2): 391-411.
doi: 10.1007/s11192-012-0681-1
[48] Elmacioglu E, Tan Y F, Yan S, et al. PSNUS: Web People Name Disambiguation by Simple Clustering with Rich Features[C]// Proceedings of the 4th International Workshop on Semantic Evaluations. 2007: 268-271.
[49] Asharaf S, Murty M N. An Adaptive Rough Fuzzy Single Pass Algorithm for Clustering Large Data Sets[J]. Pattern Recognition, 2003, 36(12): 3015-3018.
doi: 10.1016/S0031-3203(03)00081-5
[50] Zhu J, Wu X C, Lin X Q, et al. A Novel Multiple Layers Name Disambiguation Framework for Digital Libraries Using Dynamic Clustering[J]. Scientometrics, 2018, 114(3): 781-794.
doi: 10.1007/s11192-017-2611-8
[51] Han H, Xu W, Zha H Y, et al. A Hierarchical Naive Bayes Mixture Model for Name Disambiguation in Author Citations[C]// Proceedings of the 2005 ACM Symposium on Applied Computing. 2005: 1065-1069.
[52] Bhattacharya I, Getoor L. A Latent Dirichlet Model for Unsupervised Entity Resolution[C]// Proceedings of the 2006 SIAM International Conference on Data Mining. 2006: 47-58.
[53] Kim K, Khabsa M, Giles C L. Inventor Name Disambiguation for a Patent Database Using a Random Forest and DBSCAN[C]// Proceedings of the 2016 IEEE/ACM Joint Conference on Digital Libraries. IEEE, 2016: 269-270.
[54] Chen P Y, Choudhury S, Hero A O. Multi-Centrality Graph Spectral Decompositions and Their Application to Cyber Intrusion Detection[C]// Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2016: 4553-4557.
[55] Malin B. Unsupervised Name Disambiguation via Social Network Similarity[C]// Proceedings of the 2005 SIAM Workshop on Link Analysis, Counterterrorism, and Security. 2005: 93-102.
[56] Ma X, Wang R R, Zhang Y. Author Name Disambiguation in Heterogeneous Academic Networks[C]// Proceedings of the 16th International Conference on Web Information Systems and Applications. 2019:126-137.
[57] Tran H N, Huynh T, Do T. Author Name Disambiguation by Using Deep Neural Network[C]// Proceedings of the 6th Asian Conference on Intelligent Information and Database Systems. 2014:123-132.
[58] Wagstaff K, Cardie C. Clustering with Instance-Level Constraints[C]// Proceedings of the 7th International Conference on Machine Learning. 2000:1103-1110.
[1] 李慧, 胡吉霞, 佟志颖. 面向多源数据的学科主题挖掘与演化分析*[J]. 数据分析与知识发现, 2022, 6(7): 44-55.
[2] 段建勇, 徐丽闪, 刘杰, 李欣, 张家铭, 王昊. 基于义原知识和双向注意力流的问题生成模型*[J]. 数据分析与知识发现, 2022, 6(5): 44-53.
[3] 刘小玲, 谭宗颖. 基于专利多属性融合的技术主题划分方法研究[J]. 数据分析与知识发现, 2022, 6(2/3): 45-54.
[4] 张乐, 冷基栋, 吕学强, 袁梦龙, 游新冬. MWEC:一种基于多语义词向量的中文新词发现方法*[J]. 数据分析与知识发现, 2022, 6(1): 113-121.
[5] 韩辉, 刘秀文. 海事适任评估中主观题自动评分技术研究*[J]. 数据分析与知识发现, 2021, 5(8): 113-121.
[6] 刘文斌, 何彦青, 吴振峰, 董诚. 基于BERT和多相似度融合的句子对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[7] 闫强,张笑妍,周思敏. 基于义原相似度的关键词抽取方法 *[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
[8] 向卓元,刘志聪,吴玉. 基于用户行为自适应推荐模型研究 *[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[9] 吕学强,罗艺雄,李家全,游新冬. 中文专利侵权检测研究综述*[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[10] 吴彦文, 蔡秋亭, 刘智, 邓云泽. 融合多源数据和场景相似度计算的数字资源推荐研究*[J]. 数据分析与知识发现, 2021, 5(11): 114-123.
[11] 盛嘉祺, 许鑫. 融合主题相似度与合著网络的学者标签扩展方法研究*[J]. 数据分析与知识发现, 2020, 4(8): 75-85.
[12] 徐以聪,田学东,李新福,杨芳,史青宣. 基于犹豫模糊权重的数学表达式检索 *[J]. 数据分析与知识发现, 2020, 4(7): 118-126.
[13] 苏庆,陈思兆,吴伟民,李小妹,黄佃宽. 基于学习情况协同过滤算法的个性化学习推荐模型研究*[J]. 数据分析与知识发现, 2020, 4(5): 105-117.
[14] 刘萍,彭小芳. 基于形式概念分析的词汇相似度计算*[J]. 数据分析与知识发现, 2020, 4(5): 66-74.
[15] 高原,施元磊,张蕾,曹天奕,冯筠. 基于游记文本的游客游览行程重构*[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn