领域双语数据增强的学术文本摘要结构识别研究*

doi:10.11925/infotech.2096-3467.2022.0476

数据分析与知识发现

2023, Vol. 7

Issue (8): 105-118 https://doi.org/10.11925/infotech.2096-3467.2022.0476

研究论文

本期目录 | 过刊浏览 | 高级检索

领域双语数据增强的学术文本摘要结构识别研究*

刘江峰¹,冯钰童¹,刘浏¹,沈思²,王东波¹(

)

¹南京农业大学信息管理学院南京 210095
²南京理工大学经济管理学院南京 210094

Structural Recognition of Abstracts of Academic Text Enhanced by Domain Bilingual Data

Liu Jiangfeng¹,Feng Yutong¹,Liu Liu¹,Shen Si²,Wang Dongbo¹(

)

¹College of Information Management, Nanjing Agricultural University, Nanjing 210095, China
²School of Economics & Management, Nanjing University of Science and Technology, Nanjing 210094, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (1944 KB) HTML ( 9 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】准确把握社会科学学术文献的核心内容，提升文献摘要的语步结构识别效果。【方法】使用预训练语言模型在多种图书情报领域核心期刊的双语摘要数据上进行实验，提出一种分别在模型的预训练、微调、模型输出层使用领域数据进行增强学习的方法。【结果】充分利用领域双语数据进行增强预训练、微调以及融合双语句子分类概率能够在单期刊数据上将摘要结构识别的F1值提升约1~2、1、0.5~1个百分点。【局限】 限于计算资源，未在跨语言预训练模型上进行领域数据的继续预训练并测试性能。【结论】研究充分利用学术文献中的双语资源，有效提升了摘要语步结构识别效果，对快速了解文献内容、促进科学交流具有一定意义。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	刘江峰
	冯钰童
	刘浏
	沈思
	王东波

关键词 ：跨语言, 数据增强, 预训练模型, 语步识别, 概率融合

Abstract：

[Objective] This paper aims to grasp the core content of social science academic literature accurately and improve the structure recognition effect of literature abstracts. [Methods] An experiment was conducted on the bilingual abstract data of several core periodicals in the field of library and information science by using pre-training language model, and an enhanced learning method was proposed by using domain data in the stages of pre-training, fine-tuning and model's output layer. [Results] Enhancement pre-training, fine-tuning, and fusion of bilingual sentence classification probability could improve the F1 values of abstract structure recognition by 1 to 2, 1, and 0.5 to 1 percentage point on single journal data, respectively. [Limitations] Due to limited computing resources, the field bilingual text continued pre-training and performance test were not conducted on the cross-language pre-training model. [Conclusions] This research makes full use of bilingual resources in academic literature and effectively improves the recognition effect of abstract structure, which is of certain significance to quickly understand the content of literature and promote scientific communication.

Key words： Cross-Language Data Enhancement Pre-Trained Model Move Recognition Probability Integration

收稿日期: 2022-05-12 出版日期: 2023-10-08

ZTFLH:

G353

基金资助:* 国家自然科学基金项目(71974094)

通讯作者: 王东波，ORCID： 0000-0002-9894-9550，E-mail： db.wang@njau.edu.cn。

引用本文:

刘江峰, 冯钰童, 刘浏, 沈思, 王东波. 领域双语数据增强的学术文本摘要结构识别研究*[J]. 数据分析与知识发现, 2023, 7(8): 105-118.
Liu Jiangfeng, Feng Yutong, Liu Liu, Shen Si, Wang Dongbo. Structural Recognition of Abstracts of Academic Text Enhanced by Domain Bilingual Data. Data Analysis and Knowledge Discovery, 2023, 7(8): 105-118.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0476 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I8/105

Fig.1 研究框架

符号	释义
tag={index：categories}	tag：字典类型，index表示摘要句子类型的id号，categories表示摘要句子类型
move_tag	表示模型预测的摘要句子类型
M	m=len（tag）表示一共有m种摘要句子类型
cn，en	分别表示中文句子、对应的英文句子
$p i, j c n$ ， $p i, j e n$	分别表示第i个中文或英文句子属于第j个摘要句子类型的概率
concat	表示拼接法计算融合句子类别概率
weighted	表示加权法计算融合句子类别概率
α， β	分别表示在使用加权法计算融合句子类别概率时，中文和英文的概率矩阵被赋予的权重

Table 1 公式符号的释义

Table 2 各期刊论文数量分布

Table 3 语步标签统一规范

Fig.2 中英文摘要句长分布

Fig.3 单一篇章摘要平均包含句子数（中文）

Table 4 领域数据增强预训练的实验结果

Fig.4 预训练模型性能

Table 5 领域数据增强预训练的实验结果（仅《数据分析与知识发现》数据）

Table 6 跨语言模型双语数据增强实验数据集

Table 7 跨语言模型双语数据增强实验结果（F1）

Table 8 跨语言模型双语数据增强实验结果（F1）（仅《数据分析与知识发现》数据）

Table 9 跨语言模型最佳性能与基准模型的对比（F1）

Table 10 跨语言模型最佳性能与基准模型的对比（F1）（仅《数据分析与知识发现》数据）

Table 11 加权法概率权重

Fig.5 双语概率融合实验结果-Accuracy（仅《数据分析与知识发现》数据）

Fig.6 双语概率融合实验结果-Macro Avg（F1）（仅《数据分析与知识发现》数据）

Fig.7 双语概率融合实验结果-Weighted Avg（F1）（仅《数据分析与知识发现》数据）

Fig.8 最佳性能预测结果（F1）（仅《数据分析与知识发现》数据）

Table 12 模糊识别性能（F1）（仅《数据分析与知识发现》数据）

[1]	张智雄, 刘欢, 丁良萍, 等. 不同深度学习模型的科技论文摘要语步识别效果对比研究[J]. 数据分析与知识发现, 2019, 3(12): 1-9.
[1]	(Zhang Zhixiong, Liu Huan, Ding Liangping, et al. Identifying Moves of Research Abstracts with Deep Learning Methods[J]. Data Analysis and Knowledge Discovery, 2019, 3(12): 1-9.)
[2]	Swales J M. Research Genres: Explorations and Applications[M]. Cambridge, UK: Cambridge University Press, 2004.
[3]	Beltagy I, Lo K, Cohan A. SciBERT: A Pretrained Language Model for Scientific Text[OL]. arXiv Preprint, arXiv: 1903.10676.
[4]	Lee J, Yoon W, Kim S, et al. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining[J]. Bioinformatics, 2020, 36(4): 1234-1240. doi: 10.1093/bioinformatics/btz682 pmid: 31501885
[5]	田亮, 李博闻, 章成志. 基于学术论文全文的跨语言研究方法自动分类研究[J]. 图书馆建设, 2022(1): 75-86.
[5]	(Tian Liang, Li Bowen, Zhang Chengzhi. Classification of Cross-Lingual Research Methods Based on Full-Text Content of Academic Articles[J]. Library Development, 2022(1): 75-86.)
[6]	张乐, 卫乃兴. 学术论文中篇章性句干的型式和功能研究[J]. 解放军外国语学院学报, 2013, 36(2): 8-15.
[6]	(Zhang Le, Wei Naixing. Patterns and Functions of Textual Sentence Stems in Research Articles[J]. Journal of PLA University of Foreign Languages, 2013, 36(2): 8-15.)
[7]	王立非, 刘霞. 英语学术论文摘要语步结构自动识别模型的构建[J]. 外语电化教学, 2017(2): 45-50.
[7]	(Wang Lifei, Liu Xia. Constructing a Model for the Automatic Identification of Move Structure in English Research Article Abstracts[J]. Technology Enhanced Foreign Language Education, 2017(2): 45-50.)
[8]	丁良萍, 张智雄, 刘欢. 影响支持向量机模型语步自动识别效果的因素研究[J]. 数据分析与知识发现, 2019, 3(11): 16-23.
[8]	(Ding Liangping, Zhang Zhixiong, Liu Huan. Factors Affecting Rhetorical Move Recognition with SVM Model[J]. Data Analysis and Knowledge Discovery, 2019, 3(11): 16-23.)
[9]	赵丹宁, 牟冬梅, 白森. 基于深度学习的科技文献摘要结构要素自动抽取方法研究[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[9]	(Zhao Danning, Mu Dongmei, Bai Sen. Automatically Extracting Structural Elements of Sci-Tech Literature Abstracts Based on Deep Learning[J]. Data Analysis and Knowledge Discovery, 2021, 5(7): 70-80.)
[10]	王末, 崔运鹏, 陈丽, 等. 基于深度学习的学术论文语步结构分类方法研究[J]. 数据分析与知识发现, 2020, 4(6): 60-68.
[10]	(Wang Mo, Cui Yunpeng, Chen Li, et al. A Deep Learning-Based Method of Argumentative Zoning for Research Articles[J]. Data Analysis and Knowledge Discovery, 2020, 4(6): 60-68.)
[11]	郭航程, 何彦青, 兰天, 等. 基于Paragraph-BERT-CRF的科技论文摘要语步功能信息识别方法研究[J]. 数据分析与知识发现, 2022, 6(2/3): 298-307.
[11]	(Guo Hangcheng, He Yanqing, Lan Tian, et al. Identifying Moves from Scientific Abstracts Based on Paragraph-BERT-CRF[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 298-307.)
[12]	赵旸, 张智雄, 刘欢, 等. 基金项目摘要的语步识别系统设计与实现[J]. 情报理论与实践, 2022, 45(8): 162-168.
[12]	(Zhao Yang, Zhang Zhixiong, Liu Huan, et al. Design and Implementation of the Move Recognition System for Fund Project Abstract[J]. Information Studies: Theory & Application, 2022, 45(8): 162-168.)
[13]	宋若璇, 钱力, 杜宇. 基于科技论文中未来工作句集的学术创新构想话题自动生成方法研究[J]. 数据分析与知识发现, 2021, 5(5): 10-20.
[13]	(Song Ruoxuan, Qian Li, Du Yu. Identifying Academic Creative Concept Topics Based on Future Work of Scientific Papers[J]. Data Analysis and Knowledge Discovery, 2021, 5(5): 10-20.)
[14]	罗卓然, 蔡乐, 钱佳佳, 等. 学术论文创新贡献句识别研究[J]. 图书情报工作, 2021, 65(12): 93-100. doi: 10.13266/j.issn.0252-3116.2021.12.009
[14]	(Luo Zhuoran, Cai Le, Qian Jiajia, et al. Research on the Recognition of Innovative Contribution Sentences of Academic Papers[J]. Library and Information Service, 2021, 65(12): 93-100.) doi: 10.13266/j.issn.0252-3116.2021.12.009
[15]	Lo K, Wang L L, Neumann M, et al. S2ORC: The Semantic Scholar Open Research Corpus[OL]. arXiv Preprint, arXiv: 1911.02782.
[16]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[17]	Liu Y H, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv: 1907.11692.
[18]	Cui Y M, Che W X, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514. doi: 10.1109/TASLP.2021.3124365
[19]	Conneau A, Khandelwal K, Goyal N, et al. Unsupervised Cross-Lingual Representation Learning at Scale[OL]. arXiv Preprint, arXiv: 1911.02116.
[20]	Chi Z W, Dong L, Zheng B, et al. Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment[OL]. arXiv Preprint, arXiv: 2106.06381.
[21]	Bird S, Klein E, Loper E. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit[M]. O’Reilly Media, Inc., 2009.

[1]	邓娜, 何昕洋, 陈伟杰, 陈旭. MPMFC：一种融合网络邻里结构特征和专利语义特征的中药专利分类模型^*[J]. 数据分析与知识发现, 2023, 7(4): 145-158.
[2]	李岱峰, 林凯欣, 李栩婷. 基于提示学习与T5 PEGASUS的图书宣传自动摘要生成器^*[J]. 数据分析与知识发现, 2023, 7(3): 121-130.
[3]	赵朝阳, 朱贵波, 王金桥. ChatGPT给语言大模型带来的启示和多模态大模型新的发展思路^*[J]. 数据分析与知识发现, 2023, 7(3): 26-35.
[4]	钱力, 刘熠, 张智雄, 李雪思, 谢靖, 许钦亚, 黎洋, 管铮懿, 李西雨, 文森. ChatGPT的技术基础分析^*[J]. 数据分析与知识发现, 2023, 7(3): 6-15.
[5]	赵一鸣, 潘沛, 毛进. 基于任务知识融合与文本数据增强的医学信息查询意图强度识别研究^*[J]. 数据分析与知识发现, 2023, 7(2): 38-47.
[6]	徐月梅, 曹晗, 王文清, 杜宛泽, 徐承炀. 跨语言情感分析研究综述*[J]. 数据分析与知识发现, 2023, 7(1): 1-21.
[7]	佟昕瑀, 赵蕊洁, 路永和. 基于预训练模型的多标签专利分类研究^*[J]. 数据分析与知识发现, 2022, 6(2/3): 129-137.
[8]	刘兴丽, 范俊杰, 马海群. 面向小样本命名实体识别的数据增强算法改进策略研究^*[J]. 数据分析与知识发现, 2022, 6(10): 128-141.
[9]	陈星月, 倪丽萍, 倪志伟. 基于ELECTRA模型与词性特征的金融事件抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 36-47.
[10]	刘彤,刘琛,倪维健. 多层次数据增强的半监督中文情感分析方法^*[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
[11]	赵旸, 张智雄, 刘欢, 丁良萍. 基于BERT模型的中文医学文献分类研究*[J]. 数据分析与知识发现, 2020, 4(8): 41-49.
[12]	梁野,李小元,许航,胡伊然. CLOpin:一种面向舆情分析与预警领域的跨语言知识图谱架构*[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[13]	张金柱,主立鹏,刘菁婕. 基于表示学习的无监督跨语言专利推荐研究^*[J]. 数据分析与知识发现, 2020, 4(10): 93-103.
[14]	黄名选,卢守东,徐辉. 基于加权关联模式挖掘与规则后件扩展的跨语言信息检索 ^*[J]. 数据分析与知识发现, 2019, 3(9): 77-87.
[15]	张智雄,刘欢,丁良萍,吴朋民,于改红. 不同深度学习模型的科技论文摘要语步识别效果对比研究 ^*[J]. 数据分析与知识发现, 2019, 3(12): 1-9.

Viewed

Full text

Abstract

Cited

Shared

Discussed