Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (8): 105-118    DOI: 10.11925/infotech.2096-3467.2022.0476
Current Issue | Archive | Adv Search |
Structural Recognition of Abstracts of Academic Text Enhanced by Domain Bilingual Data
Liu Jiangfeng1,Feng Yutong1,Liu Liu1,Shen Si2,Wang Dongbo1()
1College of Information Management, Nanjing Agricultural University, Nanjing 210095, China
2School of Economics & Management, Nanjing University of Science and Technology, Nanjing 210094, China
Download: PDF (1944 KB)   HTML ( 9
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to grasp the core content of social science academic literature accurately and improve the structure recognition effect of literature abstracts. [Methods] An experiment was conducted on the bilingual abstract data of several core periodicals in the field of library and information science by using pre-training language model, and an enhanced learning method was proposed by using domain data in the stages of pre-training, fine-tuning and model's output layer. [Results] Enhancement pre-training, fine-tuning, and fusion of bilingual sentence classification probability could improve the F1 values of abstract structure recognition by 1 to 2, 1, and 0.5 to 1 percentage point on single journal data, respectively. [Limitations] Due to limited computing resources, the field bilingual text continued pre-training and performance test were not conducted on the cross-language pre-training model. [Conclusions] This research makes full use of bilingual resources in academic literature and effectively improves the recognition effect of abstract structure, which is of certain significance to quickly understand the content of literature and promote scientific communication.

Key wordsCross-Language      Data Enhancement      Pre-Trained Model      Move Recognition      Probability Integration     
Received: 12 May 2022      Published: 08 October 2023
ZTFLH:  G353  
Fund:National Natural Science Foundation of China(71974094)
Corresponding Authors: Wang Dongbo,ORCID: 0000-0002-9894-9550,E-mail: db.wang@njau.edu.cn。   

Cite this article:

Liu Jiangfeng, Feng Yutong, Liu Liu, Shen Si, Wang Dongbo. Structural Recognition of Abstracts of Academic Text Enhanced by Domain Bilingual Data. Data Analysis and Knowledge Discovery, 2023, 7(8): 105-118.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0476     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I8/105

Research Framework
符号 释义
tag={index:categories} tag:字典类型,index表示摘要句子类型的id号,categories表示摘要句子类型
move_tag 表示模型预测的摘要句子类型
M m=len(tag) 表示一共有m种摘要句子类型
cn,en 分别表示中文句子、对应的英文句子
p i , j c n p i , j e n 分别表示第i个中文或英文句子属于第j个摘要句子类型的概率
concat 表示拼接法计算融合句子类别概率
weighted 表示加权法计算融合句子类别概率
α, β 分别表示在使用加权法计算融合句子类别概率时,中文和英文的概率矩阵被赋予的权重
Definition of Formula Symbol
2014 2015 2016 2017 2018 2019 2020 2021 2022 总计
《情报科学》 - - - 358 352 329 304 298 72 1 713
《情报理论与实践》 - - 82 179 249 273 291 288 99 1 461
《情报杂志》 - - 418 396 365 352 351 336 83 2 301
《情报资料工作》 - - - - - - 76 71 21 168
《图书情报工作》 - 485 456 427 405 389 389 365 18 2 934
《图书情报知识》 - - - - - 71 70 78 11 230
《现代情报》 - - - - 145 223 217 211 64 860
《数据分析与知识发现》 152 155 139 192 190 123 143 61 - 1 155
总计 152 640 1 095 1 552 1 706 1 760 1 841 1 708 368 10 822
Distribution of the Number of Articles in Each Journal
规范标签 原标签(中文) 原标签(英文)
目的 目的、目标、应用背景 Objective、Context
方法 方法、文献范围 Method(s)、Coverage
结果 结果、讨论 Result(s)、Discussion
局限 局限 Limitation(s)
结论 结论、创新/价值 Conclusion(s)、Values
Unified Specification of Move Labels
Sentence Length Distribution of Chinese and English Abstract
Average Number of Sentences in a Single Abstract (Chinese)
模型 领域数据增强 目的 方法 结果 局限 结论 Accuracy Macro Avg Weighted Avg
BERT-Base-Chinese (Base) 0.869 7 0.905 4 0.461 0 0.715 9 0.801 4 0.832 1 0.750 7 0.825 7
Chinese-RoBERTa-wwm,ext 0.878 7 0.904 8 0.476 0 0.701 5 0.805 3 0.835 9 0.753 3 0.830 1
BERT-Base-Chinese cSS+ 0.874 7 0.908 6 0.512 7 0.741 3 0.818 3 0.842 5 0.771 1 0.837 7
Chinese-RoBERTa-wwm,ext cSS+ 0.875 0 0.904 9 0.527 1 0.737 6 0.817 3 0.841 8 0.772 4 0.837 3
BERT-Base-Cased (Base) 0.844 4 0.889 1 0.202 5 0.617 1 0.765 9 0.798 3 0.663 8 0.780 8
SciBERT-Scivocab-Cased 0.860 4 0.901 2 0.386 0 0.684 6 0.792 6 0.822 0 0.725 0 0.812 6
SciBERT-Scivocab-Cased eSS+ 0.861 4 0.899 2 0.415 2 0.689 7 0.791 8 0.822 6 0.731 4 0.814 4
Experimental Results of Domain Data Enhanced Pre-Training
Performance of Pretrained Model
模型 领域数据增强 目的 方法 结果 局限 结论 Accuracy Macro Avg Weighted Avg
BERT-Base-Chinese (Base) 0.831 5 0.877 2 0.806 1 0.954 1 0.836 8 0.858 5 0.861 1 0.857 8
Chinese-RoBERTa-wwm,ext 0.554 1 0.747 8 0.727 3 0.830 5 0.637 3 0.706 0 0.699 4 0.698 2
BERT-Base-Chinese cSS+ 0.837 0 0.885 0 0.828 4 0.957 7 0.846 2 0.867 9 0.870 9 0.868 0
Chinese-RoBERTa-wwm,ext cSS+ 0.846 4 0.900 3 0.832 7 0.958 5 0.833 3 0.872 6 0.874 3 0.872 4
BERT-Base-Cased (Base) 0.731 5 0.832 3 0.804 3 0.919 4 0.757 3 0.808 2 0.809 0 0.807 2
SciBERT-Scivocab-Cased 0.777 4 0.840 1 0.793 4 0.934 6 0.758 0 0.817 6 0.820 7 0.818 7
SciBERT-Scivocab-Cased eSS+ 0.764 5 0.844 7 0.815 7 0.923 8 0.798 1 0.828 6 0.829 4 0.827 0
Experimental Results of Domain Data Enhanced Pre-Training (<DAKD> Data Only)
数据集 C-C E-C Bilingual-C C-E E-E Bilingual-E
Train 中文
英文
双语混合
Test 中文
英文
Bilingual Data Enhancement Experimental Data Set for Cross-language Model
预训练模型 指标 数据集语种类型
C-C (Base) E-C Bilingual-C C-E E-E Bilingual-E
BERT-Base-Multilingual-Cased (Base) Accuracy 0.810 9 0.771 4 0.818 0 0.743 1 0.795 7 0.809 6
Macro Avg 0.694 5 0.634 4 0.718 0 0.622 2 0.686 6 0.709 4
Weight Avg 0.796 0 0.750 1 0.807 4 0.729 1 0.782 4 0.800 6
XLM-RoBERTa-Base Accuracy 0.816 0 0.777 8 0.826 2 0.780 0 0.799 0 0.805 2
Macro Avg 0.687 2 0.651 0 0.704 7 0.648 0 0.671 3 0.680 8
Weight Avg 0.779 1 0.757 1 0.809 6 0.762 2 0.781 1 0.788 3
XLM-Align-Base Accuracy 0.817 5 0.793 2 0.824 2 0.782 4 0.798 5 0.805 8
Macro Avg 0.681 9 0.664 1 0.699 5 0.650 4 0.666 0 0.685 2
Weight Avg 0.796 2 0.770 5 0.806 7 0.764 5 0.779 5 0.790 2
Experimental Results of Cross-Language Model with Bilingual Data Enhancement (F1)
预训练模型 指标 数据集语种类型
C-C (Base) E-C Bilingual-C C-E E-E Bilingual-E
BERT-Base-Multilingual-Cased (Base) Accuracy 0.845 9 0.778 3 0.841 2 0.713 8 0.808 2 0.820 8
Macro Avg 0.847 8 0.773 7 0.843 8 0.719 9 0.810 2 0.824 5
Weight Avg 0.844 6 0.772 5 0.841 5 0.713 0 0.807 6 0.822 5
XLM-RoBERTa-Base Accuracy 0.845 9 0.822 3 0.867 9 0.737 4 0.823 9 0.822 3
Macro Avg 0.847 1 0.823 1 0.870 0 0.742 5 0.826 7 0.826 3
Weight Avg 0.844 7 0.821 2 0.867 8 0.739 5 0.823 5 0.823 3
XLM-Align-Base Accuracy 0.797 2 0.812 9 0.852 2 0.746 9 0.795 6 0.809 7
Macro Avg 0.797 7 0.812 5 0.852 6 0.744 0 0.796 4 0.811 8
Weight Avg 0.794 2 0.747 5 0.851 6 0.741 2 0.795 1 0.809 7
Experimental Results of Cross-Language Model with Bilingual Data Enhancement (F1) (<DAKD> Data Only)
模型 数据集语种类型 Accuracy Macro Avg Weighted Avg
BERT-Base-Chinese (Base) C-C 0.832 1 0.750 7 0.825 7
Chinese-RoBERTa-wwm,ext C-C 0.835 9 0.753 3 0.830 1
BERT-Base-Chinese(cSS+) C-C 0.842 5 0.771 1 0.837 7
Chinese-RoBERTa-wwm,ext(cSS+) C-C 0.841 8 0.772 4 0.837 3
BERT-Base-Cased (Base) C-C 0.798 3 0.663 8 0.780 8
SciBERT-Scivocab-Cased C-C 0.822 0 0.725 0 0.812 6
SciBERT-Scivocab-Cased(eSS+) C-C 0.822 6 0.731 4 0.814 4
BERT-Base-Multilingual-Cased (Base) Bilingual-C 0.818 0 0.718 0 0.807 4
XLM-RoBERTa-Base Bilingual-C 0.826 2 0.704 7 0.809 6
XLM-Align-Base Bilingual-C 0.824 2 0.699 5 0.806 7
Best Performance of Cross-Language Model Versus Benchmark Model (F1)
模型 数据集语种类型 Accuracy Macro Avg Weighted Avg
BERT-Base-Chinese (Base) C-C 0.858 5 0.861 1 0.857 8
Chinese-RoBERTa-wwm,ext C-C 0.706 0 0.699 4 0.698 2
BERT-Base-Chinese(cSS+) C-C 0.867 9 0.870 9 0.868 0
Chinese-RoBERTa-wwm,ext(cSS+) C-C 0.872 6 0.874 3 0.872 4
BERT-Base-Cased (Base) C-C 0.808 2 0.809 0 0.807 2
SciBERT-Scivocab-Cased C-C 0.817 6 0.820 7 0.818 7
SciBERT-Scivocab-Cased(eSS+) C-C 0.828 6 0.829 4 0.827 0
BERT-Base-Multilingual-Cased (Base) C-C 0.845 9 0.847 8 0.844 6
XLM-RoBERTa-Base Bilingual-C 0.867 9 0.870 0 0.867 8
XLM-Align-Base Bilingual-C 0.852 2 0.852 6 0.861 6
Best Performance of Cross-Language Model Versus Benchmark Model (F1) (<DAKD> Data Only)
序号 中文模型概率权重 英文模型概率权重
1 3 7
2 4 6
3 5 5
4 6 4
5 7 3
Probability Weight for Weighted Method
Experiment Results of Bilingual Probability Fusion-Accuracy (<DAKD> Data Only)
Experiment Results of Bilingual Probability Fusion-Macro Avg(F1) (<DAKD> Data Only)
Experiment Results of Bilingual Probability Fusion - Weighted Avg(F1) (<DAKD> Data Only)
Best Performance of Prediction Results (F1) (<DAKD> Data Only)
数据增强模型 基准模型 句子数
目的 0.939 8 0.909 8 133
方法 0.993 0 0.957 7 142
结果 0.862 1 0.896 6 145
局限 1.000 0 0.990 8 109
结论 0.925 2 0.962 6 107
加权平均 0.941 8 0.940 3 636
Performance of Fuzzy Recognition (F1) (<DAKD> Data Only)
[1] 张智雄, 刘欢, 丁良萍, 等. 不同深度学习模型的科技论文摘要语步识别效果对比研究[J]. 数据分析与知识发现, 2019, 3(12): 1-9.
[1] (Zhang Zhixiong, Liu Huan, Ding Liangping, et al. Identifying Moves of Research Abstracts with Deep Learning Methods[J]. Data Analysis and Knowledge Discovery, 2019, 3(12): 1-9.)
[2] Swales J M. Research Genres: Explorations and Applications[M]. Cambridge, UK: Cambridge University Press, 2004.
[3] Beltagy I, Lo K, Cohan A. SciBERT: A Pretrained Language Model for Scientific Text[OL]. arXiv Preprint, arXiv: 1903.10676.
[4] Lee J, Yoon W, Kim S, et al. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining[J]. Bioinformatics, 2020, 36(4): 1234-1240.
doi: 10.1093/bioinformatics/btz682 pmid: 31501885
[5] 田亮, 李博闻, 章成志. 基于学术论文全文的跨语言研究方法自动分类研究[J]. 图书馆建设, 2022(1): 75-86.
[5] (Tian Liang, Li Bowen, Zhang Chengzhi. Classification of Cross-Lingual Research Methods Based on Full-Text Content of Academic Articles[J]. Library Development, 2022(1): 75-86.)
[6] 张乐, 卫乃兴. 学术论文中篇章性句干的型式和功能研究[J]. 解放军外国语学院学报, 2013, 36(2): 8-15.
[6] (Zhang Le, Wei Naixing. Patterns and Functions of Textual Sentence Stems in Research Articles[J]. Journal of PLA University of Foreign Languages, 2013, 36(2): 8-15.)
[7] 王立非, 刘霞. 英语学术论文摘要语步结构自动识别模型的构建[J]. 外语电化教学, 2017(2): 45-50.
[7] (Wang Lifei, Liu Xia. Constructing a Model for the Automatic Identification of Move Structure in English Research Article Abstracts[J]. Technology Enhanced Foreign Language Education, 2017(2): 45-50.)
[8] 丁良萍, 张智雄, 刘欢. 影响支持向量机模型语步自动识别效果的因素研究[J]. 数据分析与知识发现, 2019, 3(11): 16-23.
[8] (Ding Liangping, Zhang Zhixiong, Liu Huan. Factors Affecting Rhetorical Move Recognition with SVM Model[J]. Data Analysis and Knowledge Discovery, 2019, 3(11): 16-23.)
[9] 赵丹宁, 牟冬梅, 白森. 基于深度学习的科技文献摘要结构要素自动抽取方法研究[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[9] (Zhao Danning, Mu Dongmei, Bai Sen. Automatically Extracting Structural Elements of Sci-Tech Literature Abstracts Based on Deep Learning[J]. Data Analysis and Knowledge Discovery, 2021, 5(7): 70-80.)
[10] 王末, 崔运鹏, 陈丽, 等. 基于深度学习的学术论文语步结构分类方法研究[J]. 数据分析与知识发现, 2020, 4(6): 60-68.
[10] (Wang Mo, Cui Yunpeng, Chen Li, et al. A Deep Learning-Based Method of Argumentative Zoning for Research Articles[J]. Data Analysis and Knowledge Discovery, 2020, 4(6): 60-68.)
[11] 郭航程, 何彦青, 兰天, 等. 基于Paragraph-BERT-CRF的科技论文摘要语步功能信息识别方法研究[J]. 数据分析与知识发现, 2022, 6(2/3): 298-307.
[11] (Guo Hangcheng, He Yanqing, Lan Tian, et al. Identifying Moves from Scientific Abstracts Based on Paragraph-BERT-CRF[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 298-307.)
[12] 赵旸, 张智雄, 刘欢, 等. 基金项目摘要的语步识别系统设计与实现[J]. 情报理论与实践, 2022, 45(8): 162-168.
[12] (Zhao Yang, Zhang Zhixiong, Liu Huan, et al. Design and Implementation of the Move Recognition System for Fund Project Abstract[J]. Information Studies: Theory & Application, 2022, 45(8): 162-168.)
[13] 宋若璇, 钱力, 杜宇. 基于科技论文中未来工作句集的学术创新构想话题自动生成方法研究[J]. 数据分析与知识发现, 2021, 5(5): 10-20.
[13] (Song Ruoxuan, Qian Li, Du Yu. Identifying Academic Creative Concept Topics Based on Future Work of Scientific Papers[J]. Data Analysis and Knowledge Discovery, 2021, 5(5): 10-20.)
[14] 罗卓然, 蔡乐, 钱佳佳, 等. 学术论文创新贡献句识别研究[J]. 图书情报工作, 2021, 65(12): 93-100.
doi: 10.13266/j.issn.0252-3116.2021.12.009
[14] (Luo Zhuoran, Cai Le, Qian Jiajia, et al. Research on the Recognition of Innovative Contribution Sentences of Academic Papers[J]. Library and Information Service, 2021, 65(12): 93-100.)
doi: 10.13266/j.issn.0252-3116.2021.12.009
[15] Lo K, Wang L L, Neumann M, et al. S2ORC: The Semantic Scholar Open Research Corpus[OL]. arXiv Preprint, arXiv: 1911.02782.
[16] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[17] Liu Y H, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv: 1907.11692.
[18] Cui Y M, Che W X, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.
doi: 10.1109/TASLP.2021.3124365
[19] Conneau A, Khandelwal K, Goyal N, et al. Unsupervised Cross-Lingual Representation Learning at Scale[OL]. arXiv Preprint, arXiv: 1911.02116.
[20] Chi Z W, Dong L, Zheng B, et al. Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment[OL]. arXiv Preprint, arXiv: 2106.06381.
[21] Bird S, Klein E, Loper E. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit[M]. O’Reilly Media, Inc., 2009.
[1] Li Daifeng, Lin Kaixin, Li Xuting. Identifying Named Entities of Adverse Drug Reaction with Adversarial Transfer Learning[J]. 数据分析与知识发现, 2023, 7(3): 121-130.
[2] Zhao Yiming, Pan Pei, Mao Jin. Recognizing Intensity of Medical Query Intentions Based on Task Knowledge Fusion and Text Data Enhancement[J]. 数据分析与知识发现, 2023, 7(2): 38-47.
[3] Chen Xingyue, Ni Liping, Ni Zhiwei. Extracting Financial Events with ELECTRA and Part-of-Speech[J]. 数据分析与知识发现, 2021, 5(7): 36-47.
[4] Zhang Jinzhu,Zhu Lipeng,Liu Jingjie. Unsupervised Cross-Language Model for Patent Recommendation Based on Representation[J]. 数据分析与知识发现, 2020, 4(10): 93-103.
[5] Liangping Ding,Zhixiong Zhang,Huan Liu. Factors Affecting Rhetorical Move Recognition with SVM Model[J]. 数据分析与知识发现, 2019, 3(11): 16-23.
[6] Deng Sanhong,Wan Jiexi,Wang Hao,Liu Xiwen. Experimental Study of Multilingual Text Clustering[J]. 现代图书情报技术, 2014, 30(1): 28-35.
[7] Liu Sa Zhang Chengzhi. Survey of Multilingual Document Representation[J]. 现代图书情报技术, 2010, 26(6): 33-41.
[8] Zhang Liyi,Zhang Zhenyun. A New Cross-Language Commodity Information Retrieval Approach in Book Searching[J]. 现代图书情报技术, 2010, 26(1): 9-14.
[9] Wu Dan. Design and Implementation of an English-Chinese Interactive Cross-Language Information Retrieval System[J]. 现代图书情报技术, 2009, 3(2): 89-95.
[10] Wang Miaoya,Lai Maosheng. Query Translation Techniques and It’s Research Development in  Cross-Language Information Retrieval[J]. 现代图书情报技术, 2005, 21(4): 37-41.
[11] Huang Guocai. Design of Cross-language Meta Search Engine[J]. 现代图书情报技术, 2001, 17(4): 31-33.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn