[Objective] This paper investigates the recognition of moves in full-text academic papers. It establishes a solid foundation for automatically understanding paper contents. Existing research on move recognition in academic papers only processes a small number of moves with coarse granularity. There are few open datasets for move classification. [Methods] Based on the BERT model, we constructed a move classification dataset of academic papers with multi-stage fine-tuning. Then, we proposed a move recognition model incorporating the section titles to recognize moves at a fine-grained level. [Results] For the 22-class classification, the overall accuracy of the RoBERTa-wwm-ext model increased by 0.031 to 0.909, and the Micro-F1 improved by 0.022 to 0.837. [Limitations] There is a small amount of unbalanced data in the constructed corpus, and the paper's quality will affect by the proposed model's performance. [Conclusions] The proposed model benefits the automatic understanding of academic papers, research quality evaluation, and semantic content retrieval, which play important roles in using scientific and technological literature.
杜新玉, 李宁. 中文学术论文全文语步识别研究*[J]. 数据分析与知识发现, 2024, 8(2): 74-83.
Du Xinyu, Li Ning. Identifying Moves in Full-Text Chinese Academic Papers. Data Analysis and Knowledge Discovery, 2024, 8(2): 74-83.
Du X Y, Li N. Academic Paper Knowledge Graph, the Construction and Application[C]// Proceedings of the 2022 3rd International Conference on Big Data and Artificial Intelligence and Software Engineering. 2022: 15-27.
(Xue Jiaxiu, Ou Shiyan. Research Progress on Discourse Structure Modelling and Discourse Parsing of Scientific Articles[J]. Library & Information, 2019(2): 120-132.)
Zhu Liping, Li Hongqi, Yang Zhongguo, et al. An Information Extraction Method for Scientific Literature Introduction[J]. Journal of Shandong University(Natural Science), 2015, 50(7):23- 30, 37.)
[5]
王蜜蜜. 中外英语学术论文结论部分的语步及词块对比分析[D]. 新乡: 河南师范大学, 2020.
[5]
(Wang Mimi. A Comparative Analysis of Moves and Lexical Bundles in the Conclusion Part of Chinese and International English Academic Writing[D]. Xinxiang: Henan Normal University, 2020.)
(Zhou Haichen, Zheng Dejun, Li Tianyu. Research on the Identification of Academic Innovation Contributions of Full Academic Texts[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(8): 845-851.)
(Cao Shujin, Zhao Bang, Yue Wenyu, et al. Research on the Identification and Retrieval Entry of Innovation Points of Academic Papers — Taking the Papers of Information Science Journals as an Example[J]. Journal of Modern Information, 2021, 41(12): 17-27.)
doi: 10.3969/j.issn.1008-0821.2021.12.002
(Zhang Yingyi, Zhang Chengzhi. Methodological and Automatic Sentence Extraction from Academic Article's Full-Text[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(6): 640-650.)
(Li Pengcheng, Cheng Qikai. Identification of Research Methods in Information Science Based on Knowledge Role[J]. Journal of Intelligence, 2021, 40(7): 23-29.)
(Cao Shujin, Yan Xinyang, Zhang Qian, et al. Research on Characteristics of Innovation in Chinese and International Academic Literature of Information Science[J]. Library and Information Service, 2020, 64(1): 80-92.)
doi: 10.13266/j.issn.0252-3116.2020.01.011
(Hou Yutao, Abulizi Abudukelimu, Abudukelimu Halidanmu. Advances in Chinese Pre-training Models[J]. Computer Science, 2022, 49(7): 148-163.)
doi: 10.11896/jsjkx.211200018
[12]
Gupta S, Manning C D. Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers[C]// Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011: 1-9.
[13]
Houngbo H, Mercer R E. Method Mention Extraction from Scientific Research Papers[C]// Proceedings of COLING 2012. 2012: 1211-1222.
(Bai Guangzu, He Yuanbiao, Ma Jianxia, et al. Application of Machine Learning with Limited Corpus to Identify Structure of Scientific Abstracts Automatically[J]. New Technology of Library and Information Service, 2014(7): 34-40.)
[15]
Soonklang T. Move Classification in Scientific Abstracts Using Linguistic Features[C]// Proceedings of the 11th International Symposium on Natural Language Processing. 2016.
(Chen Guo, Xu Tianxiang. Sentence Function Recognition Based on Active Learning[J]. Data Analysis and Knowledge Discovery, 2019, 3(8): 53-61.)
[17]
Hirohata K, Okazaki N, Ananiadou S, et al. Identifying Sections in Scientific Abstracts Using Conditional Random Fields[C]// Proceedings of the 3rd International Joint Conference on Natural Language Processing. 2008: 381-388.
(Wang Lifei, Liu Xia. Constructing a Model for the Automatic Identification of Move Structure in English Research Article Abstracts[J]. Technology Enhanced Foreign Language Education, 2017(2): 45-50.)
[19]
Dayrell C Jr. Candido A, Lima G, et al. Rhetorical Move Detection in English Abstracts: Multi-label Sentence Classifiers and Their Annotated Corpora[C]// Proceedings of the 8th International Conference on Language Resources and Evaluation. 2012: 1604-1609.
[20]
Pendar N, Cotos E. Automatic Identification of Discourse Moves in Scientific Article Introductions[C]// Proceedings of the 3rd Workshop on Innovative Use of NLP for Building Educational Applications. 2008: 62-70.
[21]
Cui Y M, Che W X, Liu T, et al. Pre-training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.
doi: 10.1109/TASLP.2021.3124365
[22]
Jin D, Szolovits P. Hierarchical Neural Networks for Sequential Sentence Classification in Medical Scientific Abstracts[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018: 3100-3109.
[23]
Yu G H, Zhang Z X, Liu H, et al. Masked Sentence Model Based on BERT for Move Recognition in Medical Scientific Abstracts[J]. Journal of Data and Information Science, 2019, 4(4):42-55.
doi: 10.2478/jdis-2019-0020
(Guo Hangcheng, He Yanqing, Lan Tian, et al. Identifying Moves from Scientific Abstracts Based on Paragraph-BERT-CRF[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 298-307.)
(Zhao Yang, Zhang Zhixiong, Liu Huan, et al. Design and Implementation of the Move Recognition System for Fund Project Abstract[J]. Information Studies: Theory & Application, 2022, 45(8): 162-168.)
(Liu Jiangfeng, Feng Yutong, Liu Liu, et al. Structural Recognition of Abstracts of Academic Text Enhanced by Domain Bilingual Data[J]. Data Analysis and Knowledge Discovery, 2023, 7(8): 105-118.)
(Li Xuesi, Zhang Zhixiong, Liu Huan. Automatic Recognition of Concept Definition Sentences Based on Bert Model[J]. Information Science, 2022, 40(8): 160-166.)
(Zhang Zhixiong, Liu Huan, Ding Liangping, et al. Identifying Moves of Research Abstracts with Deep Learning Methods[J]. Data Analysis and Knowledge Discovery, 2019, 3(12): 1-9.)
(Wang Mo, Cui Yunpeng, Chen Li, et al. A Deep Learning-Based Method of Argumentative Zoning for Research Articles[J]. Data Analysis and Knowledge Discovery, 2020, 4(6): 60-68.)
(Ou Shiyan, Chen Jiawen. The Research on Automatic Recognition of Moves in Full-text Scientific Papers[J]. Journal of Modern Information, 2021, 41(11): 3-11.)
doi: 10.3969/j.issn.1008-0821.2021.11.001
[31]
Cunningham H, Tablan V, Roberts A, et al. Getting More Out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics[J]. PLoS Computational Biology, 2013, 9(2): e1002854.
doi: 10.1371/journal.pcbi.1002854
[32]
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1(Long and Short Papers). 2019: 4171-4186.