Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (12): 14-24     https://doi.org/10.11925/infotech.2096-3467.2021.0608
  综述评介 本期目录 | 过刊浏览 | 高级检索 |
基于机器学习技术的自动引文分类研究综述*
周志超()
北京大学医学图书馆 北京 100191
Review of Automatic Citation Classification Based on Machine Learning
Zhou Zhichao()
Health Science Library, Peking University, Beijing 100191, China
全文: PDF (744 KB)   HTML ( 32
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 梳理和总结自然语言处理和机器学习技术在自动引文分类中的应用现状。【文献范围】 在Scopus数据库以citation classification、citation polarity、citation function、 feature selection等关键词为基础构建检索策略,筛选出代表性文献共46篇。【方法】 从引文分类流程、引文分类任务、技术方法等角度对当前研究进行分析和评述,并探讨研究趋势和挑战。【结果】 引文功能分类研究有从多分类向二分类转移的趋势;深度学习模型可以同时实现引文情感和功能分类;自动引文分类面临语料库学科单一、引用语境界定存在争议、分类数据不平衡性等问题。【局限】 主要基于文献对自动引文分类研究进行评述,对产业界的分类系统和平台的调研覆盖不够。【结论】 建议制定和完善关于代码、数据、语料等科研数据重用的评价方式,鼓励开放共享;结合引文分类和引文频次构建多维度的评价模型;基于用户的检索结果,智能化推荐支持该研究的文献或观点冲突的文献供进一步阅读。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
周志超
关键词 自动引文分类自然语言处理引文内容分析文本分类机器学习    
Abstract

[Objective] This paper summarizes the application of natural language processing and machine learning technology in automatic citation classification. [Coverage] We searched “citation classification”, “citation polarity”, “citation function” and “feature selection” with Scopus database, and retrieved a total of 46 representative literature. [Methods] These research was reviewed from the perspectives of citation classification process, tasks and methods. Then, we discussed their future development trends and challenges. [Results] The research of citation classification is shifting from multi-class to binary class. Deep learning model can classify sentiments and functions of citations simultaneously. The challenges facing automatic citation classification include single discipline corpus, controversial definition of citation contexts and unbalanced classification data. [Limitations] This review does not discuss many classification systems in the industry. [Conclusions] We need to develop the evaluation method for re-using scientific research data such as codes, data and corpus, which could help to build open science. Combining citation classification and counts could establish a multi-dimensional evaluation model. Based on the user’s search results, the system could recommend documents supporting or objecting the related research for further reading.

Key wordsAutomatic Citation Classification    Natural Language Processing    Citation Content Analysis    Text Classification    Machine Learning
收稿日期: 2021-06-20      出版日期: 2022-01-20
ZTFLH:  G353  
基金资助:* CALIS全国医学文献信息中心项目(CALIS-2020-01-003)
通讯作者: 周志超,ORCID:0000-0003-2498-6532     E-mail: zhouzc1987@bjmu.edu.cn
引用本文:   
周志超. 基于机器学习技术的自动引文分类研究综述*[J]. 数据分析与知识发现, 2021, 5(12): 14-24.
Zhou Zhichao. Review of Automatic Citation Classification Based on Machine Learning. Data Analysis and Knowledge Discovery, 2021, 5(12): 14-24.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0608      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I12/14
代表作者和文献 系统检索 纳入文献范围 研究重点
陆伟等[6] 1965-2013年,中英文文献 对引文内容标注框架的发展历程进行梳理
王文娟等[12] 1977-2015年,中英文文献 对不同引文文本分类标准的特点进行归纳,对不同分类方法的优缺点进行对比梳理
尹莉等[13] 2006-2013年,中英文文献 总结了基于引文上下文进行引文分类研究
王婧[14] 1964-2017年,中英文文献 对引文分类方法和技术的发展历程进行梳理
Bakhti等[15] 2006-2013年,中英文文献 总结了自动引文功能分类中所用的分类器模型
Tahamtan等[16] 2006-2018年,英文文献 梳理了关于引文动机和引文行为调查或访谈的研究
Iqbal等[17] 2006-2019年,英文文献 梳理了引文内容分析、引文分类、引文摘要和引文推荐中运用的技术方法
Table 1  已发表的引文分类综述
Fig.1  自动引文分类的流程
情感特征抽取 采用的词典/模型 重要结论 代表作者和文献
依存关系+否定词 SVM模型、NB模型、有监督的序列标签技术 4句式引文窗比单句引文的分类效果更好 Athar[19, 33]、Radev等[34]
形容词 机器学习分类器 实验是在小样本中进行的,分类器的表现受到一定影响 Parthasarathy等[35]、Sendhilkumar等[36]
N-grams NB模型 基于N-grams选择词袋作为输入特征,比单纯基于词频,能够取得更好的分类效果 Sula等[30]、Xu等[21]、Kim等[37]
情感词典 SentiWordNet、AFINN和
BingLiu三个词典,NB模型
SentiWordNet词典表现更佳 Sendhilkumar等[36]、Goodarzai等[20]
Table 2  自动引文情感分类的方法
类目代码 功能描述
Weak 本文指出了之前研究的不足
CoCoGM 中立地对研究目标或方法进行对比/比较
CoCo- 本文的研究比引文更加优秀
CoCoR0 中立地对研究结果进行对比/比较
CoCoXY 引文中明确地与其他研究进行对比
PBas 本文以被引文献作为研究基础或研究起点
PUse 本文一成不变地使用了引文的工具/算法/数据/定义
PModi 本文调整或修改了引文的工具/算法/数据/定义
PMot 引文对本文所使用的方法或所解决的问题是积极的,用于激励当前的工作
PSim 本文的研究与引文是相似的
PSup 本文的研究与引文相互兼容/互为支撑
Neut 中立地描述引文的工作,引文功能不属于以上所有类目
Table 3  Teufel的引文功能分类标注框架
类别 引文分类特征 代表作者和文献 重要结论
结构特征 引文位置 Siddharthan等[28]、Abu-Jbara等[41]、Jha等[42] 结合词汇、语言学和位置特征的STACKING分类器的F1值最大能达到51%
引文强度 Abu-Jbara等[41]、Jha等[42]、Dong等[18] 引用强度和引用位置等结构特征对于引文功能分类至关重要
词汇特征 最近的动词、副词 Abu-Jbara等[41]、Jha等[42] 最接近的动词、副词、主题线索等词汇特征对于引文功能分类至关重要
主、被动语态 Dong等[18] 揭示了施引文献的句法模式特点,如介绍当前工作的背景时通常用主动语态,叙述研究所采用工具和方法时往往用被动语态
命名实体识别 Jochim等[26] 综合运用词汇特征、词汇级语言特征、语言结构特征、位置特征、频率特征、情感特征、自移情特征和命名实体识别等8个新特征取得了较好的分类效果
Table 4  引文功能分类的结构特征和词汇特征
模型技术 重要结论 代表作者和文献
LSTM 输入64个引文特征的LSTM模型,区分引文重要性的准确率达到92.5% Hassan等[31]
注意力机制 成分注意力网络CAN能同时实现引文功能和引文情感分类 Munkhdalai等[32]
CNN+RNN 可以同时实现引文极性和引文功能分类任务 Lauscher等[52]、 Yousif等[53]
GloVe、Infersent和BERT 标注了1 000万个引文片段的引文意图,BERT的准确度最高。 Roman等[54]
Table 5  深度学习在引文分类中的应用
[1] Hirsch J E. An Index to Quantify an Individual’s Scientific Research Output[J]. Proceedings of the National Academy of Sciences of the United States of America, 2005, 102(46): 16569-16572.
pmid: 16275915
[2] Egghe L. Theory and Practise of the G-Index[J]. Scientometrics, 2006, 69(1): 131-152.
doi: 10.1007/s11192-006-0144-7
[3] Metron R K. The Sociology of Science: Theoretical and Empirical Investigations[M]. Chicago: University of Chicago Press, 1973: 50-62.
[4] Geras A, Siudem G, Gagolewski M. Should We Introduce a Dislike Button for Academic Articles?[J]. Journal of the Association for Information Science and Technology, 2020, 71(2): 221-229.
doi: 10.1002/asi.v71.2
[5] Gilbert G N. Referencing as Persuasion[J]. Social Studies of Science, 1977, 7(1): 113-122.
doi: 10.1177/030631277700700112
[6] 陆伟, 孟睿, 刘兴帮. 面向引用关系的引文内容标注框架研究[J]. 中国图书馆学报, 2014, 40(6): 93-104.
[6] (Lu Wei, Meng Rui, Liu Xingbang. A Deep Scientific Literature Mining-Oriented Framework for Citation Content Annotation[J]. Journal of Library Science in China, 2014, 40(6): 93-104.)
[7] Aljaber B, Martinez D, Stokes N, et al. Improving MeSH Classification of Biomedical Articles Using Citation Contexts[J]. Journal of Biomedical Informatics, 2011, 44(5): 881-896.
doi: 10.1016/j.jbi.2011.05.007 pmid: 21683802
[8] Zhang G, Ding Y Milojević S. Citation Content Analysis (CCA): A Framework for Syntactic and Semantic Analysis of Citation Content[J]. Journal of the American Society for Information Science and Technology, 2013, 64(7): 1490-1503.
doi: 10.1002/asi.2013.64.issue-7
[9] Cronin B. The Citation Process: The Role and Significance of Citations in Scientific Communication[M]. London: Taylor Graham, 1984: 26-28.
[10] Abu-Jbara A, Radev D. Reference Scope Identification in Citing Sentences [C]//Proceedings of 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2012: 80-90.
[11] Teufel S, Siddharthan A, Tidhar D. Automatic Classification of Citation Function [C]//Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006). 2006: 103-110.
[12] 王文娟, 马建霞, 陈春, 等. 引文文本分类与实现方法研究综述[J]. 图书情报工作, 2016, 60(6): 118-127.
[12] (Wang Wenjuan, Ma Jianxia, Chen Chun, et al. A Review of Citation Context Classifications and Implementation Methods[J]. Library and Information Service, 2016, 60(6): 118-127.)
[13] 尹莉, 郭璐, 李旭芬. 基于引用功能和引用极性的一个引用分类模型研究[J]. 情报杂志, 2018, 37(7): 139-145.
[13] (Yin Li, Guo Lu, Li Xufen. An Empirical Study on Citation Classification Based on Citation Function and Citation Polarity[J]. Journal of Intelligence, 2018, 37(7): 139-145.)
[14] 王婧. 引文内容分析研究进展[J]. 内蒙古科技与经济, 2020(17): 57-59.
[14] (Wang Jing. Research Progress of Citation Content Analysis[J]. Inner Mongolia Science Technology & Economy, 2020(17): 57-59.)
[15] Bakhti K, Niu Z D, Nyamawe A S. Semi-Automatic Annotation for Citation Function Classification [C]//Proceedings of 2018 International Conference on Control, Artificial Intelligence, Robotics & Optimization (ICCAIRO). 2018: 43-47.
[16] Tahamtan I, Bornmann L. What do Citation Counts Measure? An Updated Review of Studies on Citations in Scientific Documents Published Between 2006 and 2018[J]. Scientometrics, 2019, 121(3): 1635-1684.
doi: 10.1007/s11192-019-03243-4
[17] Iqbal S, Hassan S U, Aljohani N R, et al. A Decade of In-Text Citation Analysis Based on Natural Language Processing and Machine Learning Techniques: An Overview of Empirical Studies[J]. Scientometrics, 2021, 126(8): 6551-6599.
doi: 10.1007/s11192-021-04055-1
[18] Dong C, Schäfer U. Ensemble-Style Self-Training on Citation Classification [C]//Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011: 623-631.
[19] Athar A. Sentiment Analysis of Citations Using Sentence Structure-Based Features [C]//Proceedings of the ACL-HLT 2011 Student Session. 2011: 81-87.
[20] Goodarzi M, Mahmoudi M T, Zamani R. A Framework for Sentiment Analysis on Schema-Based Research Content via Lexica Analysis [C]//Proceedings of the 7th International Symposium on Telecommunications (IST’2014). 2014: 405-411.
[21] Xu J, Zhang Y, Wu Y, et al. Citation Sentiment Analysis in Clinical Trial Papers[J]. AMIA Annual Symposium Proceedings, 2015: 1334-1341.
[22] Ritchie A, Robertson S, Teufel S. Comparing Citation Contexts for Information Retrieval [C]//Proceedings of the 17th ACM Conference on Information and Knowledge Management. 2008: 213-222.
[23] Athar A, Teufel S. Context-Enhanced Citation Sentiment Detection [C]//Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2012: 597-601.
[24] Teufel S, Siddharthan A, Tidhar D. An Annotation Scheme for Citation Function [C]//Proceedings of the 7th SIGDIAL Workshop on Discourse and Dialogue. 2009: 80-87.
[25] Bertin M, Atanassova I. The Context of Multiple In-Text References and Their Signification[J]. International Journal on Digital Libraries, 2018, 19(2-3): 127-138.
doi: 10.1007/s00799-017-0225-7
[26] Jochim C, Hinrich S. Towards a Generic and Flexible Citation Classifier Based on a Faceted Classification Scheme [C]//Proceedings of the 24th International Conference on Computational Linguistics. 2012: 1343-1358.
[27] Valenzuela M, Ha V, Etzioni O. Identifying Meaningful Citations [C]//Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015: 21-26.
[28] Siddharthan A, Teufel S. Whose Idea was This, and Why does IT Matter? Attributing Scientific Work to Citations [C]//Proceedings of Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. 2007: 316-323.
[29] Hassan S, Iqbal S, Imran M, et al. Mining the Context of Citations in Scientific Publications [C]//Proceedings of the 20th International Conference on Asia Pacific Digital Libraries. 2018: 316-322.
[30] Sula C A, Miller M. Citations, Contexts, and Humanistic Discourse: Toward Automatic Extraction and Classification[J]. Literary and Linguistic Computing, 2014, 29(3): 452-464.
doi: 10.1093/llc/fqu019
[31] Hassan S U, Imran M, Iqbal S, et al. Deep Context of Citations Using Machine-Learning Models in Scholarly Full-Text Articles[J]. Scientometrics, 2018, 117(3): 1645-1662.
doi: 10.1007/s11192-018-2944-y
[32] Munkhdalai T, Lalor J, Yu H. Citation Analysis with Neural Attention Models [C]//Proceedings of the 7th International Workshop on Health Text Mining and Information Analysis. 2016: 69-77.
[33] Athar A, Teufel S. Detection of Implicit Citations for Sentiment Detection [C]//Proceedings of the Workshop on Detecting Structure in Scholarly Discourse. 2012: 18-26.
[34] Radev D R, Muthukrishnan P, Qazvinian V, et al. The ACL Anthology Network Corpus[J]. Language Resources and Evaluation, 2013, 47(4): 919-944.
doi: 10.1007/s10579-012-9211-2
[35] Parthasarathy G, Tomar D C. Sentiment Analyzer: Analysis of Journal Citations from Citation Databases [C]//Proceedings of the 5th International Conference-Confluence the Next Generation Information Technology Summit (Confluence). 2014: 923-928.
[36] Sendhilkumar S, Elakkiya E, Mahalakshmi G S. Citation Semantic Based Approaches to Identify Article Quality [C]//Proceedings of the 3rd International Conference on Computer Science, Engineering & Applications. 2013: 411-420.
[37] Kim I C, Thoma G R. Automated Classification of Author’s Sentiments in Citation Using Machine Learning Techniques: A Preliminary Study [C]//Proceedings of 2015 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology. 2015: 1-7.
[38] Fu X H, Liu W W, Xu Y Y, et al. Combine HowNet Lexicon to Train Phrase Recursive Autoencoder for Sentence-Level Sentiment Analysis[J]. Neurocomputing, 2017, 241: 18-27.
doi: 10.1016/j.neucom.2017.01.079
[39] Huang S, Niu Z D, Shi C Y. Automatic Construction of Domain-Specific Sentiment Lexicon Based on Constrained Label Propagation[J]. Knowledge-Based Systems, 2014, 56: 191-200.
doi: 10.1016/j.knosys.2013.11.009
[40] Teufel S, Moens M. Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status[J]. Computational Linguistics, 2002, 28(4): 409-445.
doi: 10.1162/089120102762671936
[41] Abu-Jbara A, Ezra J, Radev D. Purpose and Polarity of Citation: Towards NLP-Based Bibliometrics [C]// Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013: 596-606.
[42] Jha R, Jbara A, Qazvinian V, et al. NLP-Driven Citation Analysis for Scientometrics[J]. Natural Language Engineering, 2017, 23(1): 93-130.
doi: 10.1017/S1351324915000443
[43] Agarwal S, Choubey L, Yu H. Automatically Classifying the Role of Citations in Biomedical Articles[J]. AMIA Annual Symposium Proceedings, 2010. PMCID:PMC3041379.
[44] Sugiyama K, Kumar T, Kan M Y, et al. Identifying Citing Sentences in Research Papers Using Supervised Learning [C]//Proceedings of 2010 International Conference on Information Retrieval & Knowledge Management (CAMP). 2010: 67-72.
[45] Wang W J, Villavicencio P, Watanabe T. Analysis of Reference Relationships among Research Papers, Based on Citation Context[J]. International Journal on Artificial Intelligence Tools, 2012, 21(2): 1240004.
doi: 10.1142/S0218213012400040
[46] Small H. Characterizing Highly Cited Method and Non-Method Papers Using Citation Contexts: The Role of Uncertainty[J]. Journal of Informetrics, 2018, 12(2): 461-480.
doi: 10.1016/j.joi.2018.03.007
[47] Zhu X D, Turney P, Lemire D, et al. Measuring Academic Influence: Not All Citations are Equal[J]. Journal of the Association for Information Science and Technology, 2015, 66(2): 408-427.
doi: 10.1002/asi.2015.66.issue-2
[48] Pride D, Knoth P. Incidental or Influential? A Decade of Using Text-Mining for Citation Function Classification [C]//Proceedings of the 16th International Society of Scientometrics and Informetrics Conference. 2017: 1357-1367.
[49] Hassan S U, Akram A, Haddawy P. Identifying Important Citations Using Contextual Information from Full Text [C]//Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries. 2017: 41-48.
[50] Rao G Z, Huang W H, Feng Z Y, et al. LSTM with Sentence Representations for Document-Level Sentiment Classification[J]. Neurocomputing, 2018, 308: 49-57.
doi: 10.1016/j.neucom.2018.04.045
[51] Wang J, Peng B, Zhang X J. Using a Stacked Residual LSTM Model for Sentiment Intensity Prediction[J]. Neurocomputing, 2018, 322(17): 93-101.
doi: 10.1016/j.neucom.2018.09.049
[52] Lauscher A, Glavaš G, Ponzetto S P, et al. Investigating Convolutional Networks and Domain-Specific Embeddings for Semantic Classification of Citations [C]//Proceedings of the 6th International Workshop on Mining Scientific Publications. 2017: 24-28.
[53] Yousif A, Niu Z D, Chambua J, et al. Multi-Task Learning Model Based on Recurrent Convolutional Neural Networks for Citation Sentiment and Purpose Classification[J]. Neurocomputing, 2019, 335: 195-205.
doi: 10.1016/j.neucom.2019.01.021
[54] Roman M, Shahid A, Khan S, et al. Citation Intent Classification Using Word Embedding[J]. IEEE Access, 2021, 9: 9982-9995.
doi: 10.1109/Access.6287639
[55] Aljohani N R, Fayoumi A, Hassan S U. An In-Text Citation Classification Predictive Model for a Scholarly Search System[J]. Scientometrics, 2021, 126(7): 5509-5529.
doi: 10.1007/s11192-021-03986-z
[1] 王寒雪,崔文娟,周园春,杜一. 基于机器学习的食源性疾病致病菌识别方法*[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[3] 陈东华,赵红梅,尚小溥,张润彤. 数据驱动的大型医院手术室运营预测与优化方法研究*[J]. 数据分析与知识发现, 2021, 5(9): 115-128.
[4] 车宏鑫,王桐,王伟. 前列腺癌预测模型对比研究*[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[5] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[6] 王一钒,李博,史话,苗威,姜斌. 古汉语实体关系联合抽取的标注方法*[J]. 数据分析与知识发现, 2021, 5(9): 63-74.
[7] 苏强, 侯校理, 邹妮. 基于机器学习组合优化方法的术后感染预测模型研究*[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[8] 曹睿,廖彬,李敏,孙瑞娜. 基于XGBoost的在线短租市场价格预测及特征分析模型*[J]. 数据分析与知识发现, 2021, 5(6): 51-65.
[9] 余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[10] 钟佳娃,刘巍,王思丽,杨恒. 文本情感分析方法及应用综述*[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[11] 向卓元,刘志聪,吴玉. 基于用户行为自适应推荐模型研究 *[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[12] 王艳, 王胡燕, 余本功. 基于多特征融合的中文文本分类研究*[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[13] 柴国荣,王斌,沙勇忠. 基于多机器学习方法联合的公共卫生风险预测研究——以兰州市流感预测为例*[J]. 数据分析与知识发现, 2021, 5(1): 90-98.
[14] 唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[15] 陈东,王建冬,李慧颖,蔡思航,黄倩倩,易成岐,曹攀. 融合机器学习算法和多因素的禽肉交易量预测方法研究 *[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn