基本情况 主编致辞 收录获奖
 编委会 编辑部 审稿专家
 本刊学术规范 行业规范

A Deep Learning-based Method of Argumentative Zoning for Research Articles

Wang Mo, Cui Yunpeng,,, Chen Li, Li Huan

Agricultural Information Institute of Chinese Academy of Agricultural Sciences, Beijing 100081, China

Key Laboratory of Big Agri-data, Ministry of Agriculture and Rural Areas, Beijing 100081, China

 基金资助: *本文系中国农业科学院科技创新工程项目“多源异构农业大数据关联发现与计算挖掘”的研究成果之一.  CAAS-ASTIP-2016-AII

【目的】 以深度学习语言表征模型学习论文句子表达,以此为基础构建论文语步分类模型,提高分类效果。【方法】 采用基于深度学习预训练语言表征模型BERT,结合句子文中位置改进模型输入,以标注数据集进行迁移学习,获得句子级的嵌入表达,并以此输入神经网络分类器训练分类模型,实现论文语步分类。【结果】 基于公开数据集的实验结果表明,11类别分类任务中,总体准确率提高了29.7%,达到81.3%;在7类别核心语步分类任务中,准确率达到85.5%。【局限】 受限于实验环境,所提改进输入模型的预训练参数来源于原始的模型结构,迁移学习的参数对于新模型输入的适用程度可进一步探索。【结论】 该方法较传统的“特征构建+机器学习”分类器方法效果有大幅提高,较原始BERT模型亦有一定提高,且无须人工构建特征,模型不局限于特定语言,可应用于中文学术论文的语步分类任务,具有较大的实际应用潜力。

Abstract

[Objective] This study aims at developing a new argumentative zoning method based on deep learning language representation model to achieve better performance. [Methods] We adopted a pre-trained deep learning language representation model BERT, and improved model input with sentence position feature to conduct transfer learning on training data from biochemistry journals. The learned sentence representations were then fed into neural network classifier to achieve argumentative zoning classification. [Results] The experiment indicated that for the eleven-class task, the method achieved significant improvement for most classes. The accuracy reached 81.3%, improved by 29.7% compared to the best performance from previous studies. For the seven core classes, the model achieved an accuracy of 85.5%. [Limitations] Due to limitation on experiment environment, our refined model was trained based on pre-trained parameters, which could limit the potential for classification performance. [Conclusions] The proposed method showed significant improvement compared to shallow machine learning schema or original BERT model, and was able to avoid tedious work of feature engineering. The method is independent of language, hence also suitable for research articles in Chinese language.

Keywords： Argumentative Zoning ; Deep Learning ; Bidirectional Encoder ; Neural Networks

Wang Mo. A Deep Learning-based Method of Argumentative Zoning for Research Articles. Data Analysis and Knowledge Discovery[J], 2020, 4(6): 60-68 doi:10.11925/infotech.2096-3467.2019.0487

2 相关研究

图1

Fig. 1   An Example of Move Structure of Research Articles

（1）第一类是词袋模型（Bag of Words）[18,19]。以词袋模型表达语句的向量特征或语境句法特征,作为分类算法的输入,如贝叶斯分类器。该方法能穷尽词项特征,但不对特征进行筛选,存在特征稀疏的问题[3]

（2）第二类是基于句法或语言学规则构建特征[1,3,14,20],如句子的长度、特征词的位置、语句在篇章中的位置等。此类方法依然存在特征构建繁琐、特征不确定性等问题。

（3）第三类是基于词向量嵌入（Word Embedding）的方法[21]。基于深度学习语言模型训练的词向量嵌入蕴含语句的上下文及语义近似度信息,如Word2Vec[22]、GloVe[23]等词向量嵌入语言模型。词向量嵌入训练以语言符号（Token）为粒度,能较好地表达单个词的特征。虽然已有研究[21]将Word2Vec应用于论文语步分类任务,但只是改进了词向量的表达,可被视为词袋模型的改进。对于以句子为单位的论文语步分类任务,仍难以越过特征构建步骤。而且,基于非学术语料训练的词向量嵌入,直接应用于论文语步分类任务存在较大的语义空间不确定性。

3 深度学习语步分类方法

图2

Fig. 2   Deep Learning Classification Model Structure for Argumentative Zoning

BERT模型通过Transformer堆栈构建双向编码器表征。模型训练方法是采用大型的语料库,对语料进行随机屏蔽,并训练预测屏蔽内容,直至模型的损失函数最小。BERT通过联合调节所有层中的上下文预先训练深度双向表征,对上下文语境有较强的学习能力,具体训练机制是从语料文本中随机抽取15%单词,通过BERT模型预测这些被抽取单词位置的内容。BERT的双向机制在处理一个词的时候,能学习到该词前文和后文单词的信息,从而学习上下文的语境信息。

3.1 改进模型数据输入

BERT模型的输入是通过三种嵌入向量加和而成,分别为字符嵌入（Token Embeddings）、句子分段嵌入（Segment Embeddings）、字符位置嵌入（Position Embeddings）。模型在训练过程中构建三种嵌入向量的查询表,并将其作为模型参数在训练过程中学习。如图3所示,类似于字符位置嵌入向量,本文在此基础上提出非学习输入向量——句子位置嵌入向量,表征当前输入句子在论文中的位置,即本研究模型的输入为上述4种向量嵌入的加和。

图3

Fig. 3   Input of Proposed Argumentative Zoning Model

$SE(pos,i)=sin1000pos10000i/dmodel$
$SE(pos,i)=cos1000pos10000i-1/dmodel$

图4

Fig. 4   An Example of Sentence Position Embeddings

图5

Fig. 5   Multilayer Perceptron Classifier for Argumentative Zoning

3.3 预训练模型

SciBERT预训练模型的大部分参数将保持不变。在图2所示模型上,采用实验数据进行训练,分类器以及BERT模型的最后两层网络的参数将进行优化,实现损失函数的最小化。

4 实验过程

4.1 数据来源

Table 1  Move Structure Classes of ART Corpus Dataset

ConclusionCON结论
ResultRES结果
GoalGOA目标
MethodMET方法
ObjectOBJ对象
ExperimentEXP实验
ObservationOBS观察
HypothesisHYP假设
MotivationMOT动机
BackgroundBAC背景
ModelMOD模型

图6

Fig. 6   Example of Preprocessed Data

Table 2  Statistics of the Dataset for Each Move Structure Class

5 实验结果及分析

Table 3  Hyper-parameters of Optimum Models

11标签分类16$2e-5$4256
7标签分类32$2e-5$4128

Table 4  Classification Results of Different Argumentative Zoning Models

LibSVM 11标签51.643.046.3
11标签分类SciBERT75.268.574.6

7标签分类SciBERT80.176.478.8

图7

Fig. 7   Classification Metrics on 11-class Argumentative Zoning for Each Class(%)

支撑数据:

[1] Liakata Maria, Soldatova Larisa.ART_Corpus.tar.gz. The ART Corpus.

[2] 王末,崔运鹏,陈丽,李欢. preprocessed_sentence_core_concept_class.tsv. 预处理后数据集.

参考文献 原文顺序 文献年度倒序 文中引用次数倒序 被引期刊影响因子

Liakata M, Saha S, Dobnik S, et al.

Automatic Recognition of Conceptualization Zones in Scientific Articles and Two Life Science Applications

[J]. Bioinformatics, 2012,28(7):991-1000.

Motivation: Scholarly biomedical publications report on the findings of a research investigation. Scientists use a well-established discourse structure to relate their work to the state of the art, express their own motivation and hypotheses and report on their methods, results and conclusions. In previous work, we have proposed ways to explicitly annotate the structure of scientific investigations in scholarly publications. Here we present the means to facilitate automatic access to the scientific discourse of articles by automating the recognition of 11 categories at the sentence level, which we call Core Scientific Concepts (CoreSCs). These include: Hypothesis, Motivation, Goal, Object, Background, Method, Experiment, Model, Observation, Result and Conclusion. CoreSCs provide the structure and context to all statements and relations within an article and their automatic recognition can greatly facilitate biomedical information extraction by characterizing the different types of facts, hypotheses and evidence available in a scientific publication.
Results: We have trained and compared machine learning classifiers (support vector machines and conditional random fields) on a corpus of 265 full articles in biochemistry and chemistry to automatically recognize CoreSCs. We have evaluated our automatic classifications against a manually annotated gold standard, and have achieved promising accuracies with 'Experiment', 'Background' and 'Model' being the categories with the highest F1-scores (76%, 62% and 53%, respectively). We have analysed the task of CoreSC annotation both from a sentence classification as well as sequence labelling perspective and we present a detailed feature evaluation. The most discriminative features are local sentence features such as unigrams, bigrams and grammatical dependencies while features encoding the document structure, such as section headings, also play an important role for some of the categories. We discuss the usefulness of automatically generated CoreSCs in two biomedical applications as well as work in progress.

Teufel S, Moens M.

Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status

[J]. Computational Linguistics, 2002,28(4):409-445.

[J]. 外语电化教学, 2017(2):45-50,64.

( Wang Lifei, Liu Xia.

Constructing a Model for the Automatic Identification of Move Structure in English Research Article Abstracts

[J]. Technology Enhanced Foreign Language Education, 2017(2):45-50, 64.)

Guo Y, Korhonen A, Poibeau T.

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents

[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011: 273-283.

[J]. 上海理工大学学报(社会科学版), 2018,40(3):201-206.

( Meng Yu, Wu Xingquan.

Writing Paradigm of Plasma Physics SCI Journal Articles from the Perspective of Move Analysis Theory

[J]. Journal of University of Shanghai for Science and Technology(Social Science) , 2018,40(3):201-206.)

Teufel S, Carletta J, Moens M.

An Annotation Scheme for Discourse-Level Argumentation in Research Articles

[C]//Proceedings of the 9th Conference on European Chapter of the Association for Computational Linguistics. 1999: 110-117.

[J]. 东南大学学报(哲学社会科学版), 2013,15(5):128-133.

( Xu Fang.

A Survey on English Academic Paper Genre Studies

[J]. Journal of Southeast University (Philosphy and Social Science), 2013,15(5):128-133.)

Nasar Z, Jaffry S W, Malik M K.

Information Extraction from Scientific Articles: A Survey

[J]. Scientometrics, 2018,117(3):1931-1990.

Gupta S, Manning C D.

Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers

[C]//Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011: 1-9.

Houngbo H, Mercer R E.

Method Mention Extraction from Scientific Research Papers

[C]//Proceedings of COLING 2012. 2012:1211-1222.

Ruch P, Boyer C, Chichester C, et al.

Using Argumentation to Extract Key Sentences from Biomedical Abstracts

[J]. International Journal of Medical Informatics, 2007,76(3):195-200.

Lakhanpal S, Gupta A, Agrawal R.

Towards Extracting Domains from Research Publications

[C]// Proceedings of MAICS 2015. 2015:117-120.

Lin J, Karakos D, Demner-Fushman D, et al.

Generative Content Models for Structural Analysis of Medical Abstracts

[C]//Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology. 2006: 65-72.

Wu J C, Chang Y C, Liou H C, et al.

Computational Analysis of Move Structures in Academic Abstracts

[C]//Proceedings of the COLING/ACL on Interactive Presentation Sessions. 2006: 41-44.

Hirohata K, Okazaki N, Ananiadou S, et al.

Identifying Sections in Scientific Abstracts Using Conditional Random Fields

[C]//Proceedings of the 3rd International Joint Conference on Natural Language Processing: Volume-I. 2008: 381-388.

Lin S, Ng J P, Pradhan S, et al.

Extracting Formulaic and Free Text Clinical Research Articles Metadata Using Conditional Random Fields

[C]//Proceedings of the NAACL HLT 2010 2nd Louhi Workshop on Text and Data Mining of Health Documents. 2010: 90-95.

Ronzano F, Saggion H.

Dr. Inventor Framework: Extracting Structured Information from Scientific Publications

[C]//Proceedings of the International Conference on Discovery Science. 2015: 209-220.

Anthony L, Lashkia G V.

Mover: A Machine Learning Tool to Assist in the Reading and Writing of Technical Papers

[J]. IEEE Transactions on Professional Communication, 2003,46(3):185-193.

Guo Y, Korhonen A, Liakata M, et al.

Identifying the Information Structure of Scientific Abstracts: An Investigation of Three Different Schemes

[C]//Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. 2010: 99-107.

Dayrell C, Candido Jr A, Lima G, et al.

Rhetorical Move Detection in English Abstracts: Multi-label Sentence Classifiers and Their Annotated Corpora

[C]//Proceedings of the 8th International Conference on Language Resources and Evaluation. 2012: 1604-1609.

Liu H.

Automatic Argumentative-Zoning Using Word2vec

[OL]. arXiv Preprint, arXiv: 1703. 10152.

Mikolov T, Sutskever I, Chen K, et al.

Distributed Representations of Words and Phrases and Their Compositionality

[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.

Pennington J, Socher R, Manning C.

Glove: Global Vectors for Word Representation

[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 1532-1543.

Devlin J, Chang M-W, Lee K, et al.

Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding

[OL]. arXiv Preprint, arXiv: 1810. 04805.

Vaswani A, Shazeer N, Parmar N, et al.

Attention is All You Need

[OL]. arXiv Preprint, arXiv: 1706. 03762.

Beltagy I, Lo K, Cohan A.

SciBERT: A Pretrained Language Model for Scientific Text

[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 3606-3611.

/

 〈 〉

 版权所有 © 2015 《数据分析与知识发现》编辑部 地址：北京市海淀区中关村北四环西路33号 邮编：100190 电话/传真：(010)82626611-6626，82624938 E-mail:jishu@mail.las.ac.cn