Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (6): 60-68     https://doi.org/10.11925/infotech.2096-3467.2019.0487
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于深度学习的学术论文语步结构分类方法研究*
王末,崔运鹏(),陈丽,李欢
中国农业科学院农业信息研究所 北京 100081
农业农村部农业大数据重点实验室 北京 100081
A Deep Learning-based Method of Argumentative Zoning for Research Articles
Wang Mo,Cui Yunpeng(),Chen Li,Li Huan
Agricultural Information Institute of Chinese Academy of Agricultural Sciences, Beijing 100081, China
Key Laboratory of Big Agri-data, Ministry of Agriculture and Rural Areas, Beijing 100081, China
全文: PDF (1458 KB)   HTML ( 37
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 以深度学习语言表征模型学习论文句子表达,以此为基础构建论文语步分类模型,提高分类效果。【方法】 采用基于深度学习预训练语言表征模型BERT,结合句子文中位置改进模型输入,以标注数据集进行迁移学习,获得句子级的嵌入表达,并以此输入神经网络分类器训练分类模型,实现论文语步分类。【结果】 基于公开数据集的实验结果表明,11类别分类任务中,总体准确率提高了29.7%,达到81.3%;在7类别核心语步分类任务中,准确率达到85.5%。【局限】 受限于实验环境,所提改进输入模型的预训练参数来源于原始的模型结构,迁移学习的参数对于新模型输入的适用程度可进一步探索。【结论】 该方法较传统的“特征构建+机器学习”分类器方法效果有大幅提高,较原始BERT模型亦有一定提高,且无须人工构建特征,模型不局限于特定语言,可应用于中文学术论文的语步分类任务,具有较大的实际应用潜力。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
王末
崔运鹏
陈丽
李欢
关键词 语步分类深度学习双向编码器神经网络    
Abstract

[Objective] This study aims at developing a new argumentative zoning method based on deep learning language representation model to achieve better performance. [Methods] We adopted a pre-trained deep learning language representation model BERT, and improved model input with sentence position feature to conduct transfer learning on training data from biochemistry journals. The learned sentence representations were then fed into neural network classifier to achieve argumentative zoning classification. [Results] The experiment indicated that for the eleven-class task, the method achieved significant improvement for most classes. The accuracy reached 81.3%, improved by 29.7% compared to the best performance from previous studies. For the seven core classes, the model achieved an accuracy of 85.5%. [Limitations] Due to limitation on experiment environment, our refined model was trained based on pre-trained parameters, which could limit the potential for classification performance. [Conclusions] The proposed method showed significant improvement compared to shallow machine learning schema or original BERT model, and was able to avoid tedious work of feature engineering. The method is independent of language, hence also suitable for research articles in Chinese language.

Key wordsArgumentative Zoning    Deep Learning    Bidirectional Encoder    Neural Networks
收稿日期: 2019-05-09      出版日期: 2020-05-18
ZTFLH:  TP391  
基金资助:*本文系中国农业科学院科技创新工程项目“多源异构农业大数据关联发现与计算挖掘”的研究成果之一(CAAS-ASTIP-2016-AII)
通讯作者: 崔运鹏     E-mail: cuiyunpeng@caas.cn
引用本文:   
王末,崔运鹏,陈丽,李欢. 基于深度学习的学术论文语步结构分类方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 60-68.
Wang Mo,Cui Yunpeng,Chen Li,Li Huan. A Deep Learning-based Method of Argumentative Zoning for Research Articles. Data Analysis and Knowledge Discovery, 2020, 4(6): 60-68.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.0487      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I6/60
Fig. 1  学术论文语步结构分类示例
Fig. 2  语步分类深度学习模型结构
Fig. 3  研究模型数据输入
Fig. 4  句子位置向量输入示例
Fig. 5  多层感知机论文语步分类器结构
类别 类别缩写 中文含义
Conclusion CON 结论
Result RES 结果
Goal GOA 目标
Method MET 方法
Object OBJ 对象
Experiment EXP 实验
Observation OBS 观察
Hypothesis HYP 假设
Motivation MOT 动机
Background BAC 背景
Model MOD 模型
Table 1  ART Corpus数据集论文语步分类类别
Fig. 6  预处理后部分数据内容示例
统计指标 CON RES GOA MET OBJ EXP OBS HYP MOT BAC MOD
句子数 3 082 7 349 548 3 740 1 189 2 822 4 643 655 465 6 648 3 449
占比(%) 8.91 21.25 1.58 10.81 3.44 8.16 13.42 1.89 1.35 19.22 9.97
平均单词数 28.10 26.70 28.46 25.07 25.16 24.33 22.81 27.33 25.39 25.50 27.16
Table 2  各语步标签的论文语句统计
分类模型 批处理大小 学习率 训练期 分类器隐含层节点数
11标签分类 16 2e-5 4 256
7标签分类 32 2e-5 4 128
Table 3  各分类任务的最佳模型超参数
分类模型 总体
准确率(%)
平均
召回率(%)
平均F1(%)
LibSVM 11标签 51.6 43.0 46.3
11标签分类 SciBERT 75.2 68.5 74.6
改进输入 81.3 72.4 75.5
7标签分类 SciBERT 80.1 76.4 78.8
改进输入 85.5 80.7 83.1
Table 4  语步分类结果对比
Fig. 7  本研究模型11标签分类论文语步分类结果评价指标对比(%)
[1] Liakata M, Saha S, Dobnik S, et al. Automatic Recognition of Conceptualization Zones in Scientific Articles and Two Life Science Applications[J]. Bioinformatics, 2012,28(7):991-1000.
doi: 10.1093/bioinformatics/bts071
[2] Teufel S, Moens M. Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status[J]. Computational Linguistics, 2002,28(4):409-445.
doi: 10.1162/089120102762671936
[3] 王立非, 刘霞. 英语学术论文摘要语步结构自动识别模型的构建[J]. 外语电化教学, 2017(2):45-50,64.
[3] ( Wang Lifei, Liu Xia. Constructing a Model for the Automatic Identification of Move Structure in English Research Article Abstracts[J]. Technology Enhanced Foreign Language Education, 2017(2):45-50, 64.)
[4] Guo Y, Korhonen A, Poibeau T. A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011: 273-283.
[5] 孟愉, 伍兴权. 语步分析视角下的等离子体物理国际SCI期刊论文写作范式研究[J]. 上海理工大学学报(社会科学版), 2018,40(3):201-206.
[5] ( Meng Yu, Wu Xingquan. Writing Paradigm of Plasma Physics SCI Journal Articles from the Perspective of Move Analysis Theory[J]. Journal of University of Shanghai for Science and Technology(Social Science) , 2018,40(3):201-206.)
[6] Teufel S, Carletta J, Moens M. An Annotation Scheme for Discourse-Level Argumentation in Research Articles [C]//Proceedings of the 9th Conference on European Chapter of the Association for Computational Linguistics. 1999: 110-117.
[7] 徐昉. 英语学术语篇语类结构研究述评(1980-2012)[J]. 东南大学学报(哲学社会科学版), 2013,15(5):128-133.
[7] ( Xu Fang. A Survey on English Academic Paper Genre Studies[J]. Journal of Southeast University (Philosphy and Social Science), 2013,15(5):128-133.)
[8] Nasar Z, Jaffry S W, Malik M K. Information Extraction from Scientific Articles: A Survey[J]. Scientometrics, 2018,117(3):1931-1990.
doi: 10.1007/s11192-018-2921-5
[9] Gupta S, Manning C D. Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers [C]//Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011: 1-9.
[10] Houngbo H, Mercer R E. Method Mention Extraction from Scientific Research Papers [C]//Proceedings of COLING 2012. 2012:1211-1222.
[11] Ruch P, Boyer C, Chichester C, et al. Using Argumentation to Extract Key Sentences from Biomedical Abstracts[J]. International Journal of Medical Informatics, 2007,76(3):195-200.
doi: 10.1016/j.ijmedinf.2006.05.002
[12] Lakhanpal S, Gupta A, Agrawal R. Towards Extracting Domains from Research Publications [C]// Proceedings of MAICS 2015. 2015:117-120.
[13] Lin J, Karakos D, Demner-Fushman D, et al. Generative Content Models for Structural Analysis of Medical Abstracts [C]//Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology. 2006: 65-72.
[14] Wu J C, Chang Y C, Liou H C, et al. Computational Analysis of Move Structures in Academic Abstracts [C]//Proceedings of the COLING/ACL on Interactive Presentation Sessions. 2006: 41-44.
[15] Hirohata K, Okazaki N, Ananiadou S, et al. Identifying Sections in Scientific Abstracts Using Conditional Random Fields [C]//Proceedings of the 3rd International Joint Conference on Natural Language Processing: Volume-I. 2008: 381-388.
[16] Lin S, Ng J P, Pradhan S, et al. Extracting Formulaic and Free Text Clinical Research Articles Metadata Using Conditional Random Fields [C]//Proceedings of the NAACL HLT 2010 2nd Louhi Workshop on Text and Data Mining of Health Documents. 2010: 90-95.
[17] Ronzano F, Saggion H. Dr. Inventor Framework: Extracting Structured Information from Scientific Publications [C]//Proceedings of the International Conference on Discovery Science. 2015: 209-220.
[18] Anthony L, Lashkia G V. Mover: A Machine Learning Tool to Assist in the Reading and Writing of Technical Papers[J]. IEEE Transactions on Professional Communication, 2003,46(3):185-193.
doi: 10.1109/TPC.2003.816789
[19] Guo Y, Korhonen A, Liakata M, et al. Identifying the Information Structure of Scientific Abstracts: An Investigation of Three Different Schemes [C]//Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. 2010: 99-107.
[20] Dayrell C, Candido Jr A, Lima G, et al. Rhetorical Move Detection in English Abstracts: Multi-label Sentence Classifiers and Their Annotated Corpora [C]//Proceedings of the 8th International Conference on Language Resources and Evaluation. 2012: 1604-1609.
[21] Liu H. Automatic Argumentative-Zoning Using Word2vec [OL]. arXiv Preprint, arXiv: 1703. 10152.
[22] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality [C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[23] Pennington J, Socher R, Manning C. Glove: Global Vectors for Word Representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 1532-1543.
[24] Devlin J, Chang M-W, Lee K, et al. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810. 04805.
[25] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[OL]. arXiv Preprint, arXiv: 1706. 03762.
[26] Beltagy I, Lo K, Cohan A. SciBERT: A Pretrained Language Model for Scientific Text [C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 3606-3611.
[1] 范少萍,赵雨宣,安新颖,吴清强. 基于卷积神经网络的医学实体关系分类模型研究*[J]. 数据分析与知识发现, 2021, 5(9): 75-84.
[2] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] 范涛,王昊,吴鹏. 基于图卷积神经网络和依存句法分析的网民负面情感分析研究*[J]. 数据分析与知识发现, 2021, 5(9): 97-106.
[4] 顾耀文, 张博文, 郑思, 杨丰春, 李姣. 基于图注意力网络的药物ADMET分类预测模型构建方法*[J]. 数据分析与知识发现, 2021, 5(8): 76-85.
[5] 张乐, 冷基栋, 吕学强, 崔卓, 王磊, 游新冬. RLCPAR:一种基于强化学习的中文专利摘要改写模型*[J]. 数据分析与知识发现, 2021, 5(7): 59-69.
[6] 赵丹宁,牟冬梅,白森. 基于深度学习的科技文献摘要结构要素自动抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[7] 徐月梅, 王子厚, 吴子歆. 一种基于CNN-BiLSTM多特征融合的股票走势预测模型*[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[8] 黄名选,蒋曹清,卢守东. 基于词嵌入与扩展词交集的查询扩展*[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[9] 钟佳娃,刘巍,王思丽,杨恒. 文本情感分析方法及应用综述*[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[10] 马莹雪,甘明鑫,肖克峻. 融合标签和内容信息的矩阵分解推荐方法*[J]. 数据分析与知识发现, 2021, 5(5): 71-82.
[11] 韩普,张展鹏,张明淘,顾亮. 基于多特征融合的中文疾病名称归一化研究*[J]. 数据分析与知识发现, 2021, 5(5): 83-94.
[12] 张国标,李洁. 融合多模态内容语义一致性的社交媒体虚假新闻检测*[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[13] 孟镇,王昊,虞为,邓三鸿,张宝隆. 基于特征融合的声乐分类研究*[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[14] 王楠,李海荣,谭舒孺. 基于改进SMOTE算法与集成学习的舆情反转预测研究*[J]. 数据分析与知识发现, 2021, 5(4): 37-48.
[15] 常城扬,王晓东,张胜磊. 基于深度学习方法对特定群体推特的动态政治情感极性分析*[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn