Please wait a minute...
Data Analysis and Knowledge Discovery  2024, Vol. 8 Issue (2): 74-83    DOI: 10.11925/infotech.2096-3467.2022.1284
Current Issue | Archive | Adv Search |
Identifying Moves in Full-Text Chinese Academic Papers
Du Xinyu(),Li Ning
Computer School, Beijing Information Science & Technology University, Beijing 100101, China
Download: PDF (1728 KB)   HTML ( 7
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper investigates the recognition of moves in full-text academic papers. It establishes a solid foundation for automatically understanding paper contents. Existing research on move recognition in academic papers only processes a small number of moves with coarse granularity. There are few open datasets for move classification. [Methods] Based on the BERT model, we constructed a move classification dataset of academic papers with multi-stage fine-tuning. Then, we proposed a move recognition model incorporating the section titles to recognize moves at a fine-grained level. [Results] For the 22-class classification, the overall accuracy of the RoBERTa-wwm-ext model increased by 0.031 to 0.909, and the Micro-F1 improved by 0.022 to 0.837. [Limitations] There is a small amount of unbalanced data in the constructed corpus, and the paper's quality will affect by the proposed model's performance. [Conclusions] The proposed model benefits the automatic understanding of academic papers, research quality evaluation, and semantic content retrieval, which play important roles in using scientific and technological literature.

Key wordsAcademic Papers Understanding      Move Recognition      Pre-trained Model     
Received: 04 December 2022      Published: 28 March 2023
ZTFLH:  TP391  
  G350  
Fund:National Natural Science Foundation of China(61672105)
Corresponding Authors: Du Xinyu,ORCID:0000-0001-5289-8199,E-mail: duxinyu_0@163.com。   

Cite this article:

Du Xinyu, Li Ning. Identifying Moves in Full-Text Chinese Academic Papers. Data Analysis and Knowledge Discovery, 2024, 8(2): 74-83.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.1284     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2024/V8/I2/74

Framework for Move Recognition Method Based on Pre-trained Model
标签名称 语步句示例
背景 在文本数据挖掘中,文本分类是一项重要的研究内容,被广泛应用于Web 搜索、日志分析、信息过滤、情感分析等领域中。
问题不足 然而,由于文本数据具有纬度高和稀疏性等特征,因此自编码器在文本挖掘领域的应用效果还有待提高。
目的 为了解决应用自编码器进行文本嵌入的过程中面临的纬度高和数据稀疏性问题,以提高其在文本分类应用中的效果。
研究意义 篇章作为词和句子更上层的一种分析粒度,在自然语言理解和自然语言生成中起到至关重要的作用,与之相对应的浅层篇章结构分析是自然语言处理中一项具有重大意义的任务,它是自然语言理解的基础。
理论基础 修辞结构理论(Rhetorical Structure Theory,RST)是篇章结构分析中的重要理论之一,文本可以将其转化成修辞结构树进行分析,修辞结构树如图1所示。
定义 篇章是指由一系列连续的子句、句子或语段构成的语言单位,其内部存在着由单词构成句子、句子构成段落、段落构成篇章的层次结构。
举例说明 篇章信息表2是一个酒店的评价示例,该酒店评论可以作一个篇章,其包含的每一条评论可以看作篇章内的每一个段落。
已有研究 2004年,Hu等提出了许多基于数据挖掘和自然语言处理的产品评论挖掘意见特征技术,然后使用特征提取的结果选择句子以生成产品评论摘要。
价值优势 PRC在特征图的通道上执行部分残差连接导致梯度源将部分梯度进行了分流在不增加层数的情况下增加了时间戳中的梯度组合数,因此部分残差连接不仅可以防止梯度弥散还可以产生各种特征组合。
本文方法 针对隐式篇章关系分类任务,提出一种基于自注意力机制和句法信息的方法。
方法描述 结合自注意力机制的双向长短时记忆模型的输入同BiLSTM,自注意力的关键部分如式(6)所示,将BiLSTM模型产生的隐藏层表示通过自注意力机制再次编码,提取出更高层次的特征表示。
方法选择 YOLO系列[5]在速度和精度方面能够达到一个很好的平衡,在实际应用中是最受欢迎的目标检测器,因此本文选择TinyYOLO作为基准检测模型,在不牺牲精度的情况下减少浮点运算和可训练参数来满足车载边缘计算单元的资源受限要求。
实验内容 最后,在ShapeNet数据集中训练上述网络结构,对所训练的网络模型进行验证并与其他基准方法进行定性比较。
实验环境 本文实验平台为Intel5处理器,16GB RAM,Ubuntu 16.04操作系统,采用Python 3.7和TensorFlow 1.14进行编译。
实验设置 本实验中batch设置为64;learning_rate表示学习率,设置为0.001;decay表示权重衰减正则项系数,设置为0.0005;;omentum表示动量,设置为0.9;ignore_thresh表示非极大值抑制算法中的IOU阈值,本实验设置为0.7。
数据 实验采用公开的20Newsgroups(20ncws)数据集。
结果描述 在MITRestaurantCorpus,MITMovieCorpus和MITMovietrivialCorpus3个数据集上,所提模型得出了良好的结果,最大F1值分别为78.74%,7.60%和71.54%。
结果评估 表3还可以发现,当属于不同类别文本的连接数量较多时,会造成特征矩阵中噪声较大,从而影响性能指标。
评估指标 本文利用微观F1测度(Micro-F1)和宏观F1测度(Macro-F1)对所有文本分类模型进行性能评估。
结论 实验结果表明,所提模型显著提升了语义槽填充任务的F1值。
贡献 本文对自编码器进行了改进,在隐藏层中加入了全局调整函数,实现了嵌入式特征向量的稀疏化,解决了文本数据的稀疏性问题,从而提高了其在后续分类应用中的准确性。
未来工作 本文虽然对时态特征进行了改进和调整,但还是有较大的提升空间;同时,本文是在已标注的语料库上进行的工作,未来将会考虑在原始语料上进行事件事实性分析。
Move Tag Set with Examples
Micro-F1 Score for Multi-stage Fine-Tuning
微调阶段 训练数据量 精确率 召回率 Micro-F1值
初始化分类模型 2 490 0.656 0.601 0.627
第一阶段(数据增强) 4 980 0.657 0.647 0.652
第二阶段 6 974 0.799 0.756 0.777
第三阶段 11 263 0.805 0.769 0.786
第四阶段 15 275 0.807 0.784 0.795
第五阶段 19 275 0.840 0.781 0.809
Recognition Result under Multi-stage Fine-Tuning
Move Recognition Framework with Section Titles as Input
环境 配置参数
处理器 Intel(R) Xeon(R) Platinum 8255C CPU @2.50GHz
显卡 NVIDIA Tesla V100-SXM2
操作系统 CentOS Linux release 7.8.2003 (Core)
语言 Python
Configuration of Experimental Environment
参数 设定值 参数说明
max_seq_length 300 最大文本长度
train_batch_size 16 模型训练批大小
eval_batch_size 8 模型验证批大小
learning_rate 2e-5 学习率
num_train_epochs 3 模型训练轮次
Model Parameters
模型 原始句子 增加章节标题文本
准确率 Micro-F1 准确率 Micro-F1
BERT-wwm-ext 0.878 0.814 0.901 0.834
RoBERTa-wwm-ext 0.878 0.815 0.909 0.837
RBT3 0.839 0.754 0.860 0.779
Move Recognition Result for Academic Papers
语步类型 Micro-F1 语步类型 Micro-F1
原始
句子
增加章节
标题文本
原始
句子
增加章节
标题文本
贡献 0.695 0.695 结果评估 0.854 0.874
目的 0.948 0.948 实验内容 0.863 0.879
问题不足 0.914 0.934 研究意义 0.7620 0.833
未来工作 0.889 0.889 结果描述 0.886 0.898
结论 0.893 0.904 实验环境 0.928 0.923
已有研究 0.937 0.965 方法描述 0.884 0.928
背景 0.813 0.851 举例说明 0.857 0.866
价值优势 0.753 0.848 本文方法 0.856 0.913
方法选择 0.545 0.615 评估指标 0.906 0.861
理论基础 0 0 定义 0.937 0.934
实验设置 0.816 0.896 数据 0.907 0.930
Results of the Different Move Recognition
Move Recognition Sample of Chinese Academic Paper
[1] Du X Y, Li N. Academic Paper Knowledge Graph, the Construction and Application[C]// Proceedings of the 2022 3rd International Conference on Big Data and Artificial Intelligence and Software Engineering. 2022: 15-27.
[2] 周明, 贾艳明, 周彩兰, 等. 基于篇章结构的英文作文自动评分方法[J]. 计算机科学, 2019, 46(3): 234-241.
doi: 10.11896/j.issn.1002-137X.2019.03.035
[2] (Zhou Ming, Jia Yanming, Zhou Cailan, et al. English Automated Essay Scoring Methods Based on Discourse Structure[J]. Computer Science, 2019, 46(3): 234-241.)
doi: 10.11896/j.issn.1002-137X.2019.03.035
[3] 薛家秀, 欧石燕. 科学论文篇章结构建模与解析研究进展[J]. 图书与情报, 2019(2): 120-132.
[3] (Xue Jiaxiu, Ou Shiyan. Research Progress on Discourse Structure Modelling and Discourse Parsing of Scientific Articles[J]. Library & Information, 2019(2): 120-132.)
[4] 朱丽萍, 李洪奇, 杨中国, 等. 一种面向科技文献引言的信息抽取方法[J]. 山东大学学报(理学版), 2015, 50(7): 23-30, 37.
[4] Zhu Liping, Li Hongqi, Yang Zhongguo, et al. An Information Extraction Method for Scientific Literature Introduction[J]. Journal of Shandong University(Natural Science), 2015, 50(7):23- 30, 37.)
[5] 王蜜蜜. 中外英语学术论文结论部分的语步及词块对比分析[D]. 新乡: 河南师范大学, 2020.
[5] (Wang Mimi. A Comparative Analysis of Moves and Lexical Bundles in the Conclusion Part of Chinese and International English Academic Writing[D]. Xinxiang: Henan Normal University, 2020.)
[6] 周海晨, 郑德俊, 郦天宇. 学术全文本的学术创新贡献识别探索[J]. 情报学报, 2020, 39(8): 845-851.
[6] (Zhou Haichen, Zheng Dejun, Li Tianyu. Research on the Identification of Academic Innovation Contributions of Full Academic Texts[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(8): 845-851.)
[7] 曹树金, 赵浜, 岳文玉, 等. 学术论文创新点的识别与检索入口研究——以情报学期刊论文为例[J]. 现代情报, 2021, 41(12): 17-27.
doi: 10.3969/j.issn.1008-0821.2021.12.002
[7] (Cao Shujin, Zhao Bang, Yue Wenyu, et al. Research on the Identification and Retrieval Entry of Innovation Points of Academic Papers — Taking the Papers of Information Science Journals as an Example[J]. Journal of Modern Information, 2021, 41(12): 17-27.)
doi: 10.3969/j.issn.1008-0821.2021.12.002
[8] 张颖怡, 章成志. 基于学术论文全文的研究方法句自动抽取研究[J]. 情报学报, 2020, 39(6): 640-650.
[8] (Zhang Yingyi, Zhang Chengzhi. Methodological and Automatic Sentence Extraction from Academic Article's Full-Text[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(6): 640-650.)
[9] 李鹏程, 程齐凯. 基于知识角色的信息学研究方法识别[J]. 情报杂志, 2021, 40(7): 23-29.
[9] (Li Pengcheng, Cheng Qikai. Identification of Research Methods in Information Science Based on Knowledge Role[J]. Journal of Intelligence, 2021, 40(7): 23-29.)
[10] 曹树金, 闫欣阳, 张倩, 等. 中外情报学论文创新性特征研究[J]. 图书情报工作, 2020, 64(1): 80-92.
doi: 10.13266/j.issn.0252-3116.2020.01.011
[10] (Cao Shujin, Yan Xinyang, Zhang Qian, et al. Research on Characteristics of Innovation in Chinese and International Academic Literature of Information Science[J]. Library and Information Service, 2020, 64(1): 80-92.)
doi: 10.13266/j.issn.0252-3116.2020.01.011
[11] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展[J]. 计算机科学, 2022, 49(7): 148-163.
doi: 10.11896/jsjkx.211200018
[11] (Hou Yutao, Abulizi Abudukelimu, Abudukelimu Halidanmu. Advances in Chinese Pre-training Models[J]. Computer Science, 2022, 49(7): 148-163.)
doi: 10.11896/jsjkx.211200018
[12] Gupta S, Manning C D. Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers[C]// Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011: 1-9.
[13] Houngbo H, Mercer R E. Method Mention Extraction from Scientific Research Papers[C]// Proceedings of COLING 2012. 2012: 1211-1222.
[14] 白光祖, 何远标, 马建霞, 等. 利用小样本量机器学习实现学术文摘结构的自动识别[J]. 现代图书情报技术, 2014(7): 34-40.
[14] (Bai Guangzu, He Yuanbiao, Ma Jianxia, et al. Application of Machine Learning with Limited Corpus to Identify Structure of Scientific Abstracts Automatically[J]. New Technology of Library and Information Service, 2014(7): 34-40.)
[15] Soonklang T. Move Classification in Scientific Abstracts Using Linguistic Features[C]// Proceedings of the 11th International Symposium on Natural Language Processing. 2016.
[16] 陈果, 许天祥. 基于主动学习的科技论文句子功能识别研究[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[16] (Chen Guo, Xu Tianxiang. Sentence Function Recognition Based on Active Learning[J]. Data Analysis and Knowledge Discovery, 2019, 3(8): 53-61.)
[17] Hirohata K, Okazaki N, Ananiadou S, et al. Identifying Sections in Scientific Abstracts Using Conditional Random Fields[C]// Proceedings of the 3rd International Joint Conference on Natural Language Processing. 2008: 381-388.
[18] 王立非, 刘霞. 英语学术论文摘要语步结构自动识别模型的构建[J]. 外语电化教学, 2017(2): 45-50.
[18] (Wang Lifei, Liu Xia. Constructing a Model for the Automatic Identification of Move Structure in English Research Article Abstracts[J]. Technology Enhanced Foreign Language Education, 2017(2): 45-50.)
[19] Dayrell C Jr. Candido A, Lima G, et al. Rhetorical Move Detection in English Abstracts: Multi-label Sentence Classifiers and Their Annotated Corpora[C]// Proceedings of the 8th International Conference on Language Resources and Evaluation. 2012: 1604-1609.
[20] Pendar N, Cotos E. Automatic Identification of Discourse Moves in Scientific Article Introductions[C]// Proceedings of the 3rd Workshop on Innovative Use of NLP for Building Educational Applications. 2008: 62-70.
[21] Cui Y M, Che W X, Liu T, et al. Pre-training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.
doi: 10.1109/TASLP.2021.3124365
[22] Jin D, Szolovits P. Hierarchical Neural Networks for Sequential Sentence Classification in Medical Scientific Abstracts[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018: 3100-3109.
[23] Yu G H, Zhang Z X, Liu H, et al. Masked Sentence Model Based on BERT for Move Recognition in Medical Scientific Abstracts[J]. Journal of Data and Information Science, 2019, 4(4):42-55.
doi: 10.2478/jdis-2019-0020
[24] 郭航程, 何彦青, 兰天, 等. 基于Paragraph-BERT-CRF的科技论文摘要语步功能信息识别方法研究[J]. 数据分析与知识发现, 2022, 6(2/3): 298-307.
[24] (Guo Hangcheng, He Yanqing, Lan Tian, et al. Identifying Moves from Scientific Abstracts Based on Paragraph-BERT-CRF[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 298-307.)
[25] 赵旸, 张智雄, 刘欢, 等. 基金项目摘要的语步识别系统设计与实现[J]. 情报理论与实践, 2022, 45(8): 162-168.
[25] (Zhao Yang, Zhang Zhixiong, Liu Huan, et al. Design and Implementation of the Move Recognition System for Fund Project Abstract[J]. Information Studies: Theory & Application, 2022, 45(8): 162-168.)
[26] 刘江峰, 冯钰童, 刘浏, 等. 领域双语数据增强的学术文本摘要结构识别研究[J]. 数据分析与知识发现, 2023, 7(8): 105-118.
[26] (Liu Jiangfeng, Feng Yutong, Liu Liu, et al. Structural Recognition of Abstracts of Academic Text Enhanced by Domain Bilingual Data[J]. Data Analysis and Knowledge Discovery, 2023, 7(8): 105-118.)
[27] 李雪思, 张智雄, 刘欢. 基于BERT模型实现概念定义句自动识别[J]. 情报科学, 2022, 40(8): 160-166.
[27] (Li Xuesi, Zhang Zhixiong, Liu Huan. Automatic Recognition of Concept Definition Sentences Based on Bert Model[J]. Information Science, 2022, 40(8): 160-166.)
[28] 张智雄, 刘欢, 丁良萍, 等. 不同深度学习模型的科技论文摘要语步识别效果对比研究[J]. 数据分析与知识发现, 2019, 3(12): 1-9.
[28] (Zhang Zhixiong, Liu Huan, Ding Liangping, et al. Identifying Moves of Research Abstracts with Deep Learning Methods[J]. Data Analysis and Knowledge Discovery, 2019, 3(12): 1-9.)
[29] 王末, 崔运鹏, 陈丽, 等. 基于深度学习的学术论文语步结构分类方法研究[J]. 数据分析与知识发现, 2020, 4(6): 60-68.
[29] (Wang Mo, Cui Yunpeng, Chen Li, et al. A Deep Learning-Based Method of Argumentative Zoning for Research Articles[J]. Data Analysis and Knowledge Discovery, 2020, 4(6): 60-68.)
[30] 欧石燕, 陈嘉文. 科学论文全文语步自动识别研究[J]. 现代情报, 2021, 41(11): 3-11.
doi: 10.3969/j.issn.1008-0821.2021.11.001
[30] (Ou Shiyan, Chen Jiawen. The Research on Automatic Recognition of Moves in Full-text Scientific Papers[J]. Journal of Modern Information, 2021, 41(11): 3-11.)
doi: 10.3969/j.issn.1008-0821.2021.11.001
[31] Cunningham H, Tablan V, Roberts A, et al. Getting More Out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics[J]. PLoS Computational Biology, 2013, 9(2): e1002854.
doi: 10.1371/journal.pcbi.1002854
[32] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1(Long and Short Papers). 2019: 4171-4186.
[1] Liu Yi, Zhang Zhixiong, Wang Yufei, Li Xuesi. Constructing Automatic Structured Synthesis Tool for Sci-Tech Literature Based on Move Recognition[J]. 数据分析与知识发现, 2024, 8(2): 65-73.
[2] Liu Jiangfeng, Feng Yutong, Liu Liu, Shen Si, Wang Dongbo. Structural Recognition of Abstracts of Academic Text Enhanced by Domain Bilingual Data[J]. 数据分析与知识发现, 2023, 7(8): 105-118.
[3] Chen Xingyue, Ni Liping, Ni Zhiwei. Extracting Financial Events with ELECTRA and Part-of-Speech[J]. 数据分析与知识发现, 2021, 5(7): 36-47.
[4] Liangping Ding,Zhixiong Zhang,Huan Liu. Factors Affecting Rhetorical Move Recognition with SVM Model[J]. 数据分析与知识发现, 2019, 3(11): 16-23.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn