Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (2/3): 298-307    DOI: 10.11925/infotech.2096-3467.2021.0973
Current Issue | Archive | Adv Search |
Identifying Moves from Scientific Abstracts Based on Paragraph-BERT-CRF
Guo Hangcheng,He Yanqing(),Lan Tian,Wu Zhenfeng,Dong Cheng
Institute of Scientific and Technical Information of China, Beijing 100038, China
Download: PDF (1116 KB)   HTML ( 20
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to automatically identify the moves from scientific paper abstracts, aiming to find the purpose, method, results, and conclusion of the paper. It also helps readers quickly receive main contents of the literature and conduct semantic retrieval. [Methods] We proposed a neural network model for abstract move recognition based on the Paragraph-BERT-CRF framework, which fully uses the context information. We also added the attention mechanism and the transfer relationship between sequence move labels. [Results] We examined our model with 94,456 abstracts of scientific papers. The weighted average precision, recall and F1 values were 97.45%, 97.44% and 97.44%, respectively. Compared with the ablation experimental results of CRF, BiLSTM, BiLSTM-CRF, BERT, BERT-CRF and Paragraph-BERT, our new model is effective. [Limitations] We only used the basic BERT-based pre-trained language model. More research is needed to optimize the model parameters and add more pre-trained language model in the recognition of move information. [Conclusions] Attention mechanism and paragraph level contextual information can effectively improve the proposed model’s inference scores and identify move information from abstracts.

Key wordsRhetorical Moves      Self-Attention Mechanism      Paragraph Context      BERT     
Received: 01 September 2021      Published: 28 February 2022
ZTFLH:  TP391  
Fund:Key Project of Institute of Scientific and Technical Information of China(ZD2021-17);Innovation Research Fund General Project of Institute of Scientific and Technical Information of China(MS2021-03);Innovation Research Fund Youth Project of Institute of Scientific and Technical Information of China(QN2021-12)
Corresponding Authors: He Yanqing,ORCID:0000-0002-8791-1581     E-mail: heyq@istic.ac.cn

Cite this article:

Guo Hangcheng, He Yanqing, Lan Tian, Wu Zhenfeng, Dong Cheng. Identifying Moves from Scientific Abstracts Based on Paragraph-BERT-CRF. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 298-307.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0973     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I2/3/298

The Overall Framework
Input Representation of Moves Structure Recognition Model
Paragraph-BERT-CRF Module
Example of Raw Text of Structured Abstracts
标签 含义
B-purpose 目的语步开始标签
I-purpose 目的语步内部标签
B-method 方法语步开始标签
I-method 方法语步内部标签
B-result 结果语步开始标签
I-result 结果语步内部标签
B-conclusion 结论语步开始标签
I-conclusion 结论语步内部标签
O 外部标签
Label Systems
参数 参数
algorithm 'lbfgs' c1 0.1
all_possible_states False c2 0.05
all_possible_transitions True max_iterations 50
averaging True verbose True
Hyperparameter Settings of CRF
参数 参数
word_emb_dim 300 learning_rate 0.0003
char_emb_dim 300 lr_decay 0.05
word_seq_feature LSTM l2 regularization 1e-8
bilstm True optimizer SGD
use_crf True iteration 30
ave_batch_loss True batch_size 50
Hyperparameter Settings of BiLSTM-CRF
参数 参数
require_improvement 1 000 batch_size 8
num_epochs 5 pad_size 512
learning_rate 5e-5 use_crf True
Hyperparameter Settings of BERT-CRF
参数 参数
train_max_seq_length 512 learning_rate 3e-5
eval_max_seq_length 512 crf_learning_rate 1e-3
per_gpu_train_batch_size 16 num_train_epochs 3
per_gpu_eval_batch_size 16 overwrite_output_dir True
Hyperparameter Settings of Paragraph-BERT-CRF
精确率P 召回率R F1 样本数support
“目的”语步 0.982 9 0.982 1 0.982 5 9 446
“方法”语步 0.991 1 0.973 1 0.982 0 9 445
“结果”语步 0.963 9 0.970 1 0.967 0 9 430
“结论”语步 0.960 1 0.972 1 0.966 1 9 428
加权平均值 0.974 5 0.974 4 0.974 4 37 749
Experimental Results of Paragraph-BERT-CRF
模型 验证集 测试集
精确率P 召回率R F1 精确率P 召回率R F1
CRF 0.686 0 0.717 1 0.701 0 0.683 0 0.709 0 0.695 1
BiLSTM 0.732 0 0.763 1 0.747 0 0.711 0 0.688 0 0.699 1
BiLSTM-CRF 0.747 0 0.774 1 0.760 0 0.724 0 0.701 0 0.712 1
BERT 0.960 5 0.960 4 0.960 1 0.959 8 0.960 3 0.960 1
BERT-CRF 0.967 6 0.967 4 0.967 4 0.965 6 0.966 2 0.966 2
Paragraph-BERT 0.973 1 0.972 5 0.972 5 0.971 4 0.971 9 0.971 9
Paragraph-BERT-CRF 0.975 2 0.973 1 0.973 1 0.974 5 0.974 4 0.974 4
Experimental Results of Different Models
[1] 科学技术报告、学位论文和学术论文的编写格式: GB 7713-1987[S]. 北京: 中国标准出版社, 1988.
[1] (Presentation of Scientific and Technical Reports, Dissertations and Scientific Papers: GB 7713-1987[S]. Beijing: Standards Press of China, 1988.)
[2] 钱寿初. 从传统摘要到结构式摘要[J]. 编辑学报, 1990, 2(1):56-60.
[2] ( Qian Shouchu. From Traditional Summary to Structured Summary[J]. Acta Editologica, 1990, 2(1):56-60.)
[3] Milward D, Bjäreland M, Hayes W, et al. Ontology-Based Interactive Information Extraction from Scientific Abstracts[J]. Comparative and Functional Genomics, 2005, 6:251456.
[4] Cross C, Oppenheim C. A Genre Analysis of Scientific Abstracts[J]. Journal of Documentation, 2006, 62(4):428-446.
doi: 10.1108/00220410610700953
[5] 杜圣梅, 朱礼军, 徐硕. 面向循证医学的科技文献摘要结构化表示研究[J]. 中国科技资源导刊, 2018, 50(6):94-100.
[5] ( Du Shengmei, Zhu Lijun, Xu Shuo. Research on Structured Presentation of Scientific Literature Abstracts for Evidence-Based Medicine[J]. China Science & Technology Resources Review, 2018, 50(6):94-100.)
[6] 郑梦悦, 秦春秀, 马续补. 面向中文科技文献非结构化摘要的知识元表示与抽取研究——基于知识元本体理论[J]. 情报理论与实践, 2020, 43(2):157-163.
[6] ( Zheng Mengyue, Qin Chunxiu, Ma Xubu. Research on Knowledge Unit Representation and Extraction for Unstructured Abstracts of Chinese Scientific and Technical Literature: Ontology Theory Based on Knowledge Unit[J]. Information Studies: Theory & Application, 2020, 43(2):157-163.)
[7] 李小乐, 王玉琢, 章成志. 针对特定任务的方法实体评估研究[J]. 情报工程, 2021, 7(4):13-26.
[7] ( Li Xiaole, Wang Yuzhuo, Zhang Chengzhi. Evaluation of Method Entities for a Special Task[J]. Technology Intelligence Engineering, 2021, 7(4):13-26.)
[8] Anthony L. A Machine Learning System for the Automatic Identification of Text Structure, and Application to Research Article Abstracts in Computer Science[D]. University of Birmingham, 2002.
[9] Wu J C, Chang Y C, Liou H C, et al. Computational Analysis of Move Structures in Academic Abstracts[C]// Proceedings of the COLING/ACL on Interactive Presentation Sessions. Association for Computational Linguistics, 2006: 41-44.
[10] McKnight L, Srinivasan P. Categorization of Sentence Types in Medical Abstracts[J]. AMIA Annual Symposium Proceedings, 2003: 440-444.
[11] 丁良萍, 张智雄, 刘欢. 影响支持向量机模型语步自动识别效果的因素研究[J]. 数据分析与知识发现, 2019, 3(11):16-23.
[11] ( Ding Liangping, Zhang Zhixiong, Liu Huan. Factors Affecting Rhetorical Move Recognition with SVM Model[J]. Data Analysis and Knowledge Discovery, 2019, 3(11):16-23.)
[12] Hirohata K, Okazaki N, Ananiadou S, et al. Identifying Sections in Scientific Abstracts Using Conditional Random Fields[C]// Proceedings of the 3rd International Joint Conference on Natural Language Processing. 2008: 381-388.
[13] 沈思, 胡昊天, 叶文豪, 等. 基于全字语义的摘要结构功能自动识别研究[J]. 情报学报, 2019, 38(1):79-88.
[13] ( Shen Si, Hu Haotian, Ye Wenhao, et al. Research on Abstract Structure Function Automatic Recognition Based on Full Character Semantics[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(1):79-88.)
[14] 张智雄, 刘欢, 丁良萍, 等. 不同深度学习模型的科技论文摘要语步识别效果对比研究[J]. 数据分析与知识发现, 2019, 3(12):1-9.
[14] ( Zhang Zhixiong, Liu Huan, Ding Liangping, et al. Identifying Moves of Research Abstracts with Deep Learning Methods[J]. Data Analysis and Knowledge Discovery, 2019, 3(12):1-9.)
[15] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 5998-6008.
[16] Yan H, Deng B C, Li X N, et al. TENER: Adapting Transformer Encoder for Named Entity Recognition[OL]. arXiv Preprint, arXiv:1911.04474.
[17] Yu G H, Zhang Z X, Liu H, et al. Masked Sentence Model Based on BERT for Move Recognition in Medical Scientific Abstracts[J]. Journal of Data and Information Science, 2019, 4(4):42-55.
doi: 10.2478/jdis-2019-0020
[18] 丁龙, 文雯, 林强. 基于预训练BERT字嵌入模型的领域实体识别[J]. 情报工程, 2019, 5(6):65-74.
[18] ( Ding Long, Wen Wen, Lin Qiang. Domain Entity Recognition Based on Pre-trained BERT Character Embedding[J]. Technology Intelligence Engineering, 2019, 5(6):65-74.)
[19] 王末, 崔运鹏, 陈丽, 等. 基于深度学习的学术论文语步结构分类方法研究[J]. 数据分析与知识发现, 2020, 4(6):60-68.
[19] ( Wang Mo, Cui Yunpeng, Chen Li, et al. A Deep Learning-Based Method of Argumentative Zoning for Research Articles[J]. Data Analysis and Knowledge Discovery, 2020, 4(6):60-68.)
[20] 郭晨睿, 王佳敏, 崔浩冉, 等. 基于SciBERT模型的引文上下文识别系统优化[J]. 情报工程, 2021, 7(5):3-14.
[20] ( Guo Chenrui, Wang Jiamin, Cui Haoran, et al. Optimization of Citation Context Recognition System Based on SciBERT Model[J]. Technology Intelligence Engineering, 2021, 7(5):3-14.)
[21] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[22] Lafferty J, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[23] Li B H, Zhou H, He J X, et al. On the Sentence Embeddings from Pre-Trained Language Models[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA, USA: Association for Computational Linguistics, 2020: 9119-9130.
[24] 宗成庆, 夏睿, 张家俊. 文本数据挖掘[M]. 北京: 清华大学出版社, 2019.
[24] ( Zong Chengqing, Xia Rui, Zhang Jiajun. Text Data Mining[M]. Beijing: Tsinghua University Press, 2019.)
[1] Zhang Yunqiu, Wang Yang, Li Bocheng. Identifying Named Entities of Chinese Electronic Medical Records Based on RoBERTa-wwm Dynamic Fusion Model[J]. 数据分析与知识发现, 2022, 6(2/3): 242-250.
[2] Wang Yongsheng, Wang Hao, Yu Wei, Zhou Zeyu. Extracting Relationship Among Characters from Local Chronicles with Text Structures and Contents[J]. 数据分析与知识发现, 2022, 6(2/3): 318-328.
[3] Xie Xingyu, Yu Bengong. Automatic Classification of E-commerce Comments with Multi-Feature Fusion Model[J]. 数据分析与知识发现, 2022, 6(1): 101-112.
[4] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[5] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[6] Ma Jiangwei, Lv Xueqiang, You Xindong, Xiao Gang, Han Junmei. Extracting Relationship Among Military Domains with BERT and Relation Position Features[J]. 数据分析与知识发现, 2021, 5(8): 1-12.
[7] Li Wenna, Zhang Zhixiong. Entity Alignment Method for Different Knowledge Repositories with Joint Semantic Representation[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[8] Wang Hao, Lin Kerou, Meng Zhen, Li Xinlei. Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[9] Yu Xuehan, He Lin, Xu Jian. Extracting Events from Ancient Books Based on RoBERTa-CRF[J]. 数据分析与知识发现, 2021, 5(7): 26-35.
[10] Lu Quan, He Chao, Chen Jing, Tian Min, Liu Ting. A Multi-Label Classification Model with Two-Stage Transfer Learning[J]. 数据分析与知识发现, 2021, 5(7): 91-100.
[11] Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Cheng. Sentence Alignment Method Based on BERT and Multi-similarity Fusion[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[12] Yin Pengbo,Pan Weimin,Zhang Haijun,Chen Degang. Identifying Clickbait with BERT-BiGA Model[J]. 数据分析与知识发现, 2021, 5(6): 126-134.
[13] Song Ruoxuan,Qian Li,Du Yu. Identifying Academic Creative Concept Topics Based on Future Work of Scientific Papers[J]. 数据分析与知识发现, 2021, 5(5): 10-20.
[14] Han Pu,Zhang Zhanpeng,Zhang Mingtao,Gu Liang. Normalizing Chinese Disease Names with Multi-feature Fusion[J]. 数据分析与知识发现, 2021, 5(5): 83-94.
[15] Hu Haotian,Ji Jinfeng,Wang Dongbo,Deng Sanhong. An Integrated Platform for Food Safety Incident Entities Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn