Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (6): 60-68     https://doi.org/10.11925/infotech.2096-3467.2019.0487
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于深度学习的学术论文语步结构分类方法研究*
王末,崔运鹏(),陈丽,李欢
中国农业科学院农业信息研究所 北京 100081
农业农村部农业大数据重点实验室 北京 100081
A Deep Learning-based Method of Argumentative Zoning for Research Articles
Wang Mo,Cui Yunpeng(),Chen Li,Li Huan
Agricultural Information Institute of Chinese Academy of Agricultural Sciences, Beijing 100081, China
Key Laboratory of Big Agri-data, Ministry of Agriculture and Rural Areas, Beijing 100081, China
全文: PDF (1458 KB)   HTML ( 17
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 以深度学习语言表征模型学习论文句子表达,以此为基础构建论文语步分类模型,提高分类效果。【方法】 采用基于深度学习预训练语言表征模型BERT,结合句子文中位置改进模型输入,以标注数据集进行迁移学习,获得句子级的嵌入表达,并以此输入神经网络分类器训练分类模型,实现论文语步分类。【结果】 基于公开数据集的实验结果表明,11类别分类任务中,总体准确率提高了29.7%,达到81.3%;在7类别核心语步分类任务中,准确率达到85.5%。【局限】 受限于实验环境,所提改进输入模型的预训练参数来源于原始的模型结构,迁移学习的参数对于新模型输入的适用程度可进一步探索。【结论】 该方法较传统的“特征构建+机器学习”分类器方法效果有大幅提高,较原始BERT模型亦有一定提高,且无须人工构建特征,模型不局限于特定语言,可应用于中文学术论文的语步分类任务,具有较大的实际应用潜力。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
王末
崔运鹏
陈丽
李欢
关键词 语步分类深度学习双向编码器神经网络    
Abstract

[Objective] This study aims at developing a new argumentative zoning method based on deep learning language representation model to achieve better performance. [Methods] We adopted a pre-trained deep learning language representation model BERT, and improved model input with sentence position feature to conduct transfer learning on training data from biochemistry journals. The learned sentence representations were then fed into neural network classifier to achieve argumentative zoning classification. [Results] The experiment indicated that for the eleven-class task, the method achieved significant improvement for most classes. The accuracy reached 81.3%, improved by 29.7% compared to the best performance from previous studies. For the seven core classes, the model achieved an accuracy of 85.5%. [Limitations] Due to limitation on experiment environment, our refined model was trained based on pre-trained parameters, which could limit the potential for classification performance. [Conclusions] The proposed method showed significant improvement compared to shallow machine learning schema or original BERT model, and was able to avoid tedious work of feature engineering. The method is independent of language, hence also suitable for research articles in Chinese language.

Key wordsArgumentative Zoning    Deep Learning    Bidirectional Encoder    Neural Networks
收稿日期: 2019-05-09      出版日期: 2020-05-18
ZTFLH:  TP391  
基金资助:*本文系中国农业科学院科技创新工程项目“多源异构农业大数据关联发现与计算挖掘”的研究成果之一(CAAS-ASTIP-2016-AII)
通讯作者: 崔运鹏     E-mail: cuiyunpeng@caas.cn
引用本文:   
王末,崔运鹏,陈丽,李欢. 基于深度学习的学术论文语步结构分类方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 60-68.
Wang Mo,Cui Yunpeng,Chen Li,Li Huan. A Deep Learning-based Method of Argumentative Zoning for Research Articles. Data Analysis and Knowledge Discovery, 2020, 4(6): 60-68.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.0487      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I6/60
Fig. 1  学术论文语步结构分类示例
Fig. 2  语步分类深度学习模型结构
Fig. 3  研究模型数据输入
Fig. 4  句子位置向量输入示例
Fig. 5  多层感知机论文语步分类器结构
类别 类别缩写 中文含义
Conclusion CON 结论
Result RES 结果
Goal GOA 目标
Method MET 方法
Object OBJ 对象
Experiment EXP 实验
Observation OBS 观察
Hypothesis HYP 假设
Motivation MOT 动机
Background BAC 背景
Model MOD 模型
Table 1  ART Corpus数据集论文语步分类类别
Fig. 6  预处理后部分数据内容示例
统计指标 CON RES GOA MET OBJ EXP OBS HYP MOT BAC MOD
句子数 3 082 7 349 548 3 740 1 189 2 822 4 643 655 465 6 648 3 449
占比(%) 8.91 21.25 1.58 10.81 3.44 8.16 13.42 1.89 1.35 19.22 9.97
平均单词数 28.10 26.70 28.46 25.07 25.16 24.33 22.81 27.33 25.39 25.50 27.16
Table 2  各语步标签的论文语句统计
分类模型 批处理大小 学习率 训练期 分类器隐含层节点数
11标签分类 16 2e-5 4 256
7标签分类 32 2e-5 4 128
Table 3  各分类任务的最佳模型超参数
分类模型 总体
准确率(%)
平均
召回率(%)
平均F1(%)
LibSVM 11标签 51.6 43.0 46.3
11标签分类 SciBERT 75.2 68.5 74.6
改进输入 81.3 72.4 75.5
7标签分类 SciBERT 80.1 76.4 78.8
改进输入 85.5 80.7 83.1
Table 4  语步分类结果对比
Fig. 7  本研究模型11标签分类论文语步分类结果评价指标对比(%)
[1] Liakata M, Saha S, Dobnik S, et al. Automatic Recognition of Conceptualization Zones in Scientific Articles and Two Life Science Applications[J]. Bioinformatics, 2012,28(7):991-1000.
doi: 10.1093/bioinformatics/bts071
[2] Teufel S, Moens M. Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status[J]. Computational Linguistics, 2002,28(4):409-445.
doi: 10.1162/089120102762671936
[3] 王立非, 刘霞. 英语学术论文摘要语步结构自动识别模型的构建[J]. 外语电化教学, 2017(2):45-50,64.
[3] ( Wang Lifei, Liu Xia. Constructing a Model for the Automatic Identification of Move Structure in English Research Article Abstracts[J]. Technology Enhanced Foreign Language Education, 2017(2):45-50, 64.)
[4] Guo Y, Korhonen A, Poibeau T. A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011: 273-283.
[5] 孟愉, 伍兴权. 语步分析视角下的等离子体物理国际SCI期刊论文写作范式研究[J]. 上海理工大学学报(社会科学版), 2018,40(3):201-206.
[5] ( Meng Yu, Wu Xingquan. Writing Paradigm of Plasma Physics SCI Journal Articles from the Perspective of Move Analysis Theory[J]. Journal of University of Shanghai for Science and Technology(Social Science) , 2018,40(3):201-206.)
[6] Teufel S, Carletta J, Moens M. An Annotation Scheme for Discourse-Level Argumentation in Research Articles [C]//Proceedings of the 9th Conference on European Chapter of the Association for Computational Linguistics. 1999: 110-117.
[7] 徐昉. 英语学术语篇语类结构研究述评(1980-2012)[J]. 东南大学学报(哲学社会科学版), 2013,15(5):128-133.
[7] ( Xu Fang. A Survey on English Academic Paper Genre Studies[J]. Journal of Southeast University (Philosphy and Social Science), 2013,15(5):128-133.)
[8] Nasar Z, Jaffry S W, Malik M K. Information Extraction from Scientific Articles: A Survey[J]. Scientometrics, 2018,117(3):1931-1990.
doi: 10.1007/s11192-018-2921-5
[9] Gupta S, Manning C D. Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers [C]//Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011: 1-9.
[10] Houngbo H, Mercer R E. Method Mention Extraction from Scientific Research Papers [C]//Proceedings of COLING 2012. 2012:1211-1222.
[11] Ruch P, Boyer C, Chichester C, et al. Using Argumentation to Extract Key Sentences from Biomedical Abstracts[J]. International Journal of Medical Informatics, 2007,76(3):195-200.
doi: 10.1016/j.ijmedinf.2006.05.002
[12] Lakhanpal S, Gupta A, Agrawal R. Towards Extracting Domains from Research Publications [C]// Proceedings of MAICS 2015. 2015:117-120.
[13] Lin J, Karakos D, Demner-Fushman D, et al. Generative Content Models for Structural Analysis of Medical Abstracts [C]//Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology. 2006: 65-72.
[14] Wu J C, Chang Y C, Liou H C, et al. Computational Analysis of Move Structures in Academic Abstracts [C]//Proceedings of the COLING/ACL on Interactive Presentation Sessions. 2006: 41-44.
[15] Hirohata K, Okazaki N, Ananiadou S, et al. Identifying Sections in Scientific Abstracts Using Conditional Random Fields [C]//Proceedings of the 3rd International Joint Conference on Natural Language Processing: Volume-I. 2008: 381-388.
[16] Lin S, Ng J P, Pradhan S, et al. Extracting Formulaic and Free Text Clinical Research Articles Metadata Using Conditional Random Fields [C]//Proceedings of the NAACL HLT 2010 2nd Louhi Workshop on Text and Data Mining of Health Documents. 2010: 90-95.
[17] Ronzano F, Saggion H. Dr. Inventor Framework: Extracting Structured Information from Scientific Publications [C]//Proceedings of the International Conference on Discovery Science. 2015: 209-220.
[18] Anthony L, Lashkia G V. Mover: A Machine Learning Tool to Assist in the Reading and Writing of Technical Papers[J]. IEEE Transactions on Professional Communication, 2003,46(3):185-193.
doi: 10.1109/TPC.2003.816789
[19] Guo Y, Korhonen A, Liakata M, et al. Identifying the Information Structure of Scientific Abstracts: An Investigation of Three Different Schemes [C]//Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. 2010: 99-107.
[20] Dayrell C, Candido Jr A, Lima G, et al. Rhetorical Move Detection in English Abstracts: Multi-label Sentence Classifiers and Their Annotated Corpora [C]//Proceedings of the 8th International Conference on Language Resources and Evaluation. 2012: 1604-1609.
[21] Liu H. Automatic Argumentative-Zoning Using Word2vec [OL]. arXiv Preprint, arXiv: 1703. 10152.
[22] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality [C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[23] Pennington J, Socher R, Manning C. Glove: Global Vectors for Word Representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 1532-1543.
[24] Devlin J, Chang M-W, Lee K, et al. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810. 04805.
[25] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[OL]. arXiv Preprint, arXiv: 1706. 03762.
[26] Beltagy I, Lo K, Cohan A. SciBERT: A Pretrained Language Model for Scientific Text [C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 3606-3611.
[1] 邱尔丽,何鸿魏,易成岐,李慧颖. 基于字符级CNN技术的公共政策网民支持度研究 *[J]. 数据分析与知识发现, 2020, 4(7): 28-37.
[2] 王鑫芸,王昊,邓三鸿,张宝隆. 面向期刊选择的学术论文内容分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 96-109.
[3] 焦启航,乐小虬. 对比关系句子生成方法研究[J]. 数据分析与知识发现, 2020, 4(6): 43-50.
[4] 王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[5] 刘伟江,魏海,运天鹤. 基于卷积神经网络的客户信用评估模型研究*[J]. 数据分析与知识发现, 2020, 4(6): 80-90.
[6] 邓思艺,乐小虬. 基于动态语义注意力的指代消解方法[J]. 数据分析与知识发现, 2020, 4(5): 46-53.
[7] 余传明,原赛,朱星宇,林虹君,张普亮,安璐. 基于深度学习的热点事件主题表示研究*[J]. 数据分析与知识发现, 2020, 4(4): 1-14.
[8] 闫春,刘璐. 基于改进SOM神经网络模型与RFM模型的非寿险客户细分研究*[J]. 数据分析与知识发现, 2020, 4(4): 83-90.
[9] 苏传东,黄孝喜,王荣波,谌志群,毛君钰,朱嘉莹,潘宇豪. 基于词嵌入融合和循环神经网络的中英文隐喻识别*[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[10] 刘彤,倪维健,孙宇健,曾庆田. 基于深度迁移学习的业务流程实例剩余执行时间预测方法*[J]. 数据分析与知识发现, 2020, 4(2/3): 134-142.
[11] 徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[12] 向菲,谢耀谈. 基于混合采样与迁移学习的患者评论识别模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 39-47.
[13] 倪维健,郭浩宇,刘彤,曾庆田. 基于多头自注意力神经网络的购物篮推荐方法*[J]. 数据分析与知识发现, 2020, 4(2/3): 68-77.
[14] 余传明,李浩男,王曼怡,黄婷婷,安璐. 基于深度学习的知识表示研究:网络视角*[J]. 数据分析与知识发现, 2020, 4(1): 63-75.
[15] 聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 *[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn