Please wait a minute...
Advanced Search
数据分析与知识发现  2024, Vol. 8 Issue (6): 44-55     https://doi.org/10.11925/infotech.2096-3467.2023.0448
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于深度主动学习的科技文献摘要结构功能识别研究*
毛进1,2(),陈子洋1,2
1武汉大学信息资源研究中心 武汉 430072
2武汉大学信息管理学院 武汉 430072
Identifying Structural Function of Scientific Literature Abstracts Based on Deep Active Learning
Mao Jin1,2(),Chen Ziyang1,2
1Center for Studies of Information Resources, Wuhan University, Wuhan 430072, China
2School of Information Management, Wuhan University, Wuhan 430072, China
全文: PDF (1106 KB)   HTML ( 8
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】探究不同深度主动学习方法对科技文献摘要的结构功能识别效果和标注成本。【方法】提出基于主动学习和序列标注的科技文献摘要结构功能识别方法,构建考虑句间上下文序列信息的SciBERT-BiLSTM-CRF模型(SBCA),然后分别提出基于摘要单句和摘要全文两个维度的基于不确定性的主动学习策略,并在PubMed 20K数据集上进行实验。【结果】SBCA模型具有最佳的识别效果,与不考虑序列信息仅使用SciBERT模型相比,F1值提升了11.93个百分点。使用基于整篇摘要的最小置信度策略达到SBCA模型的最优F1值仅需使用60%数据,使用基于单句的最小置信度策略达到SBCA模型的最优F1值仅需使用65%数据。【局限】本研究中仅构建了基于不确定性的主动学习查询策略,未考虑构建其他类别的查询策略。【结论】基于深度主动学习的方法有助于在更低注释成本的前提下进行摘要结构功能识别。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
毛进
陈子洋
关键词 深度学习文献结构功能识别语步主动学习知识组织    
Abstract

[Objective] This paper explores different DeepAL methods for identifying the structural function of scientific literature abstracts and their labeling costs. [Methods] Firstly, we constructed a SciBERT-BiLSTM-CRF model for the abstracts (SBCA), which utilized the contextual sequence information between sentences. Then, we developed an uncertainty active learning strategy for single sentences and full text of the abstracts. Finally, we conducted experiments on the PubMed 20K dataset. [Results] The SBCA model showed the best recognition performance and increased the F1 value by 11.93%, compared to the SciBERT model without sequence information. Using the Least Confidence strategy based on the abstracts, our SBCA model achieved its optimal F1 value with 60% of the experimental data. Using the Least Confidence strategy based on sentences, the SBCA model achieved optimal F1 value with 65% of the experimental data. [Limitations] In the future, we need to examine different active learning strategies in more fields or multi-language datasets. [Conclusions] The new model based on deep active learning could identify the structural function of scientific literature with a lower annotation cost.

Key wordsDeep Learning    Document Structural Function Identification    Move    Active Learning    Knowledge Organization
收稿日期: 2023-05-12      出版日期: 2024-01-08
ZTFLH:  G35  
基金资助:*国家自然科学基金项目(72174154);高校人文社会科学重点研究基地重大项目(22JJD870005)
通讯作者: 毛进,ORCID:0000-0001-9572-6709,E-mail:danveno@163.com。   
引用本文:   
毛进, 陈子洋. 基于深度主动学习的科技文献摘要结构功能识别研究*[J]. 数据分析与知识发现, 2024, 8(6): 44-55.
Mao Jin, Chen Ziyang. Identifying Structural Function of Scientific Literature Abstracts Based on Deep Active Learning. Data Analysis and Knowledge Discovery, 2024, 8(6): 44-55.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2023.0448      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2024/V8/I6/44
研究人员 词嵌入层 句子嵌入层 语义丰富层 输出层 结果
Dernoncourt等[32] Character Emb.+GloVe Bi-LSTM - CRF PubMed 20k数据集上得到的F1值为89.9%
Jin等[31] Bio Word2Vec 基于注意力池的
Bi-LSTM
Bi-LSTM CRF PubMed 20K数据集上得到的F1值为92.6%
Cohan等[33] SciBERT SciBERT-[SEP] SciBERT-[SEP] Softmax PubMed 20K数据集上得到的F1值为92.9%
Brack等[34] SciBERT 基于注意力池的
Bi-LSTM
Bi-LSTM CRF PubMed 20K数据集上得到的F1值为92.9%
Shang等[35] BERT Bi-LSTM SDLA CRF PubMed 20K数据集上得到F1值为92.8%
Table 1  序列标注代表性研究对比
Fig.1  基于主动学习的结构功能识别模型预测过程
Fig.2  摘要中单句长度分布
Fig.3  整体实验流程
模型 指标 结果(%)
SciBERT P 80.29
R 80.76
F1 80.66
SBCS P 87.35
R 87.14
F1 87.07
SBCA P 92.26
R 92.64
F1 92.59
Table 2  结构功能识别模型实验结果
Fig.4  前20周期各主动学习策略F1值折线图
周期 SBCA-Random(%) SBCA-LC-abs(%) SBCA-LC-sen(%)
1 90.790 90.312 90.968
2 91.058 91.235 91.459
3 91.324 91.958 91.621
4 91.409 92.066 91.841
5 91.511 92.172 92.070
6 91.664 92.097 92.098
7 91.673 92.325 92.261
8 91.867 92.351 92.233
9 91.862 92.444 92.378
10 91.915 92.571 92.484
11 91.938 92.546 92.457
12 92.028 92.601 92.568
13 92.046 92.582 92.629
14 92.192 92.594 92.583
15 92.116 92.623 92.641
16 92.093 92.749 92.781
17 92.241 92.654 92.826
18 92.207 92.707 92.972
19 92.296 92.738 93.011
20 92.354 92.727 92.945
Table 3  前20周期各主动学习策略F1值
结构功能 SBCA模型 SBCA-Random模型 SBCA-LC-abs模型 SBCA-LC-sen模型
P(%) R(%) F1(%) P(%) R(%) F1(%) P(%) R(%) F1(%) P(%) R(%) F1(%)
Background 73.59 81.50 77.35 72.75 84.00 77.97 79.92 79.82 79.87 81.76 85.89 83.78
Objective 73.93 61.54 67.17 71.37 68.94 70.13 75.16 72.77 73.95 81.47 74.36 77.75
Method 94.41 97.52 95.94 94.15 97.20 95.65 94.98 97.65 96.30 94.45 97.33 95.87
Result 96.61 94.34 95.47 96.36 94.33 95.34 96.89 95.14 96.01 96.60 94.33 95.45
Conclusion 95.53 94.88 95.21 95.91 95.13 95.52 96.31 95.81 96.06 95.79 95.32 95.55
Table 4  模型最佳识别结果
结构功能 Background Objective Method Result Conclusion
Background 1 954 421 62 1 10
Objective 438 1 368 65 6 3
Method 45 29 7 782 99 14
Result 3 1 269 7 452 108
Conclusion 5 1 15 133 3 519
Table 5  SBCA-LC-abs模型混淆矩阵
结构功能 Background Objective Method Result Conclusion
Background 2 094 279 52 2 11
Objective 409 1 398 67 3 3
Method 50 38 7 756 106 19
Result 5 1 317 7 389 121
Conclusion 3 0 20 149 3 501
Table 6  SBCA-LC-sen模型混淆矩阵
[1] 陈玥彤, 王昊, 李跃艳, 等. 一种面向内容差异的学术论文评价方法[J]. 信息资源管理学报, 2022, 12(4): 56-69.
doi: 10.13365/j.jirm.2022.04.056
[1] (Chen Yuetong, Wang Hao, Li Yueyan, et al. An Academic Articles Evaluation Method Oriented to Content Differentiation[J]. Journal of Information Resources Management, 2022, 12(4): 56-69.)
doi: 10.13365/j.jirm.2022.04.056
[2] Balcan M F, Beygelzimer A, Langford J. Agnostic Active Learning[J]. Journal of Computer and System Sciences, 2009, 75(1): 78-89.
[3] Settles B. Active Learning Literature Survey[D]. Madison: University of Wisconsin-Madison Department of Computer Sciences, 2006.
[4] Yu K, Zhu S H, Xu W, et al. trNon-Greedy Active Learning for Text Categorization Using Convex Ansductive Experimental Design[C]// Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2008: 635-642.
[5] Figueroa R L, Zeng-Treitler Q, Ngo L H, et al. Active Learning for Clinical Text Classification: Is It Better than Random Sampling?[J]. Journal of the American Medical Informatics Association, 2012, 19(5): 809-816.
doi: 10.1136/amiajnl-2011-000648 pmid: 22707743
[6] Schröder C, Niekler A. A Survey of Active Learning for Text Classification Using Deep Neural Networks[OL]. arXiv Preprint, arXiv: 2008.07267.
[7] Tong S, Koller D. Support Vector Machine Active Learning with Applications to Text Classification[C]// Proceedings of the 17th International Conference on Machine Learning. ACM, 2000: 999-1006.
[8] 段友祥, 张晓天. 基于主动学习的SVM评论内容分类算法的研究[J]. 计算机与数字工程, 2022, 50(3): 608-612.
[8] (Duan Youxiang, Zhang Xiaotian. Research on SVM Review Content Classification Algorithm Based on Active Learning[J]. Computer & Digital Engineering, 2022, 50(3): 608-612.)
[9] Shen Y Y, Yun H, Lipton Z C, et al. Deep Active Learning for Named Entity Recognition[OL]. arXiv Preprint, arXiv: 1707.05928.
[10] Haffari G, Sarkar A. Active Learning for Multilingual Statistical Machine Translation[C]// Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 2009: 181-189.
[11] Howard J, Ruder S. Universal Language Model Fine-Tuning for Text Classification[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 328-339.
[12] Zhang Y, Lease M, Wallace B. Active Discriminative Text Representation Learning[C]// Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017: 3386-3392.
[13] Hu R, Mac Namee B, Delany S J. Active Learning for Text Classification with Reusability[J]. Expert Systems with Applications, 2016, 45: 438-449.
[14] Lu J H, Mac Namee B. Investigating the Effectiveness of Representations Based on Pretrained Transformer-Based Language Models in Active Learning for Labelling Text Datasets[OL]. arXiv Preprint, arXiv: 2004.13138.
[15] Chen Y K, Lasko T A, Mei Q Z, et al. A Study of Active Learning Methods for Named Entity Recognition in Clinical Text[J]. Journal of Biomedical Informatics, 2015, 58: 11-18.
doi: S1532-0464(15)00203-8 pmid: 26385377
[16] 石教祥, 朱礼军, 魏超, 等. 融合迁移学习与主动学习的金融科技实体识别方法[J]. 中国科技资源导刊, 2022, 54(2): 35-45.
[16] (Shi Jiaoxiang, Zhu Lijun, Wei Chao, et al. FinTech Named Entity Recognition Based on Transfer Learning and Active Learning[J]. China Science & Technology Resources Review, 2022, 54(2): 35-45.)
[17] 景慎旗, 赵又霖. 面向中文电子病历文书的医学命名实体识别研究——一种基于半监督深度学习的方法[J]. 信息资源管理学报, 2021, 11(6): 105-115.
doi: 10.13365/j.jirm.2021.06.105
[17] (Jing Shenqi, Zhao Youlin. Recognizing Clinical Named Entity from Chinese Electronic Medical Record Texts Based on Semi-Supervised Deep Learning[J]. Journal of Information Resources Management, 2021, 11(6): 105-115.)
doi: 10.13365/j.jirm.2021.06.105
[18] 王末, 崔运鹏, 陈丽, 等. 基于深度学习的学术论文语步结构分类方法研究[J]. 数据分析与知识发现, 2020, 4(6): 60-68.
[18] (Wang Mo, Cui Yunpeng, Chen Li, et al. A Deep Learning-Based Method of Argumentative Zoning for Research Articles[J]. Data Analysis and Knowledge Discovery, 2020, 4(6): 60-68.)
[19] 丁良萍, 张智雄, 刘欢. 影响支持向量机模型语步自动识别效果的因素研究[J]. 数据分析与知识发现, 2019, 3(11): 16-23.
[19] (Ding Liangping, Zhang Zhixiong, Liu Huan. Factors Affecting Rhetorical Move Recognition with SVM Model[J]. Data Analysis and Knowledge Discovery, 2019, 3(11): 16-23.)
[20] Teufel S, Moens M. Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status[J]. Computational Linguistics, 2002, 28(4): 409-445.
[21] Swales J M. Genre Analysis: English in Academic and Research Settings[D]. Cambridge: Cambridge University Press, 1990.
[22] 曹雁, 牟爱鹏. 科技期刊英文摘要学术词汇的语步特点研究[J]. 外语学刊, 2011(3): 46-49.
[22] (Cao Yan, Mu Aipeng. The Characteristics of Academic Words Across Different Abstract Moves of English Scientific and Technical Journals[J]. Foreign Language Research, 2011(3): 46-49.)
[23] 宋东桓, 李晨英, 刘子瑜, 等. 英文科技论文摘要的语义特征词典构建[J]. 图书情报工作, 2020, 64(6): 108-119.
doi: 10.13266/j.issn.0252-3116.2020.06.013
[23] (Song Donghuan, Li Chenying, Liu Ziyu, et al. Semantic Feature Dictionary Construction of Abstract in English Scientific Journals[J]. Library and Information Service, 2020, 64(6): 108-119.)
doi: 10.13266/j.issn.0252-3116.2020.06.013
[24] 沈思, 胡昊天, 叶文豪, 等. 基于全字语义的摘要结构功能自动识别研究[J]. 情报学报, 2019, 38(1): 79-88.
[24] (Shen Si, Hu Haotian, Ye Wenhao, et al. Research on Abstract Structure Function Automatic Recognition Based on Full Character Semantics[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(1): 79-88.)
[25] 李湘东, 孙倩茹, 石健. 结合短文本扩展和BERT的商品评论文本自动分类[J]. 信息资源管理学报, 2023, 13(1): 129-139.
doi: 10.13365/j.jirm.2023.01.129
[25] (Li Xiangdong, Sun Qianru, Shi Jian. Automatic Classification of Product Review Texts Combining Short Text Extension and BERT[J]. Journal of Information Resources Management, 2023, 13(1): 129-139.)
doi: 10.13365/j.jirm.2023.01.129
[26] 张智雄, 刘欢, 丁良萍, 等. 不同深度学习模型的科技论文摘要语步识别效果对比研究[J]. 数据分析与知识发现, 2019, 3(12): 1-9.
[26] (Zhang Zhixiong, Liu Huan, Ding Liangping, et al. Identifying Moves of Research Abstracts with Deep Learning Methods[J]. Data Analysis and Knowledge Discovery, 2019, 3(12): 1-9.)
[27] Beltagy I, Lo K, Cohan A. SciBERT: A Pretrained Language Model for Scientific Text[OL]. arXiv Preprint, arXiv: 1903.10676.
[28] Dernoncourt F, Lee J Y. PubMed 200k RCT: A Dataset for Sequential Sentence Classification in Medical Abstracts[C]// Proceedings of the 8th International Joint Conference on Natural Language Processing. 2017: 308-313.
[29] 赵旸, 张智雄, 刘欢, 等. 基金项目摘要的语步识别系统设计与实现[J]. 情报理论与实践, 2022, 45(8): 162-168.
[29] (Zhao Yang, Zhang Zhixiong, Liu Huan, et al. Design and Implementation of the Move Recognition System for Fund Project Abstract[J]. Information Studies: Theory & Application, 2022, 45(8): 162-168.)
[30] 赵旸, 张智雄, 李婕. 项目申请书摘要文本的语步识别语料构建[J]. 图书情报工作, 2022, 66(21): 97-106.
doi: 10.13266/j.issn.0252-3116.2022.21.011
[30] (Zhao Yang, Zhang Zhixiong, Li Jie. The Construction of Move Recognition Corpus for Project Application Abstract[J]. Library and Information Service, 2022, 66(21): 97-106.)
doi: 10.13266/j.issn.0252-3116.2022.21.011
[31] Jin D, Szolovits P. Hierarchical Neural Networks for Sequential Sentence Classification in Medical Scientific Abstracts[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018: 3100-3109.
[32] Dernoncourt F, Lee J Y, Szolovits P. Neural Networks for Joint Sentence Classification in Medical Paper Abstracts[OL]. arXiv Preprint, arXiv: 1612.05251.
[33] Cohan A, Beltagy I, King D, et al. Pretrained Language Models for Sequential Sentence Classification[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 3693-3699.
[34] Brack A, Entrup E, Stamatakis M, et al. Sequential Sentence Classification in Research Papers Using Cross-Domain Multi-Task Learning[OL]. arXiv Preprint, arXiv:2102.06008.
[35] Shang X C, Ma Q L, Lin Z X, et al. A Span-Based Dynamic Local Attention Model for Sequential Sentence Classification[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021: 198-203.
[36] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[37] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
doi: 10.1162/neco.1997.9.8.1735 pmid: 9377276
[38] Lafferty J, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[39] David D L, William A G. A Sequential Algorithm for Training Text Classifier[C]// Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Dublin, Ireland.Berlin, Germany: Springer, 1994: 3-12.
[40] Gotmare A, Keskar N S, Xiong C, et al. A Closer Look at Deep Learning Heuristics: Learning Rate Restarts, Warmup and Distillation[OL]. arXiv Preprint, arXiv: 1810.13243.
[1] 王翼虎, 白海燕. 基于机器阅读理解的智能咨询问答系统构建*[J]. 数据分析与知识发现, 2024, 8(5): 151-162.
[2] 马志远, 高颖, 张强, 周洪, 李兵, 陶皖. 融合语义与结构信息的知识图谱补全模型研究*[J]. 数据分析与知识发现, 2024, 8(4): 39-49.
[3] 阮光册, 钟静涵, 张祎笛. 基于深度学习的术语识别研究综述[J]. 数据分析与知识发现, 2024, 8(4): 64-75.
[4] 贺国秀, 任佳渝, 李宗耀, 林晨曦, 蔚海燕. 以可解释工具重探基于深度学习的谣言检测*[J]. 数据分析与知识发现, 2024, 8(4): 1-13.
[5] 张志剑, 夏苏迪, 刘政昊. 融合多特征深度学习的印章识别及应用研究*[J]. 数据分析与知识发现, 2024, 8(3): 143-155.
[6] 杜新玉, 李宁. 中文学术论文全文语步识别研究*[J]. 数据分析与知识发现, 2024, 8(2): 74-83.
[7] 刘熠, 张智雄, 王宇飞, 李雪思. 基于语步识别的科技文献结构化自动综合工具构建*[J]. 数据分析与知识发现, 2024, 8(2): 65-73.
[8] 张雄涛, 祝娜, 郭玉慧. 基于图神经网络的会话推荐方法综述*[J]. 数据分析与知识发现, 2024, 8(2): 1-16.
[9] 吕学强, 杨雨婷, 肖刚, 李育贤, 游新冬. 稀疏样本下长术语的抽取方法*[J]. 数据分析与知识发现, 2024, 8(1): 135-145.
[10] 李慧, 胡耀华, 徐存真. 考虑评论情感表达力及其重要性的个性化推荐算法*[J]. 数据分析与知识发现, 2024, 8(1): 69-79.
[11] 向卓元, 陈浩, 王倩, 李娜. 面向任务型对话的小样本语言理解模型研究*[J]. 数据分析与知识发现, 2023, 7(9): 64-77.
[12] 刘江峰, 冯钰童, 刘浏, 沈思, 王东波. 领域双语数据增强的学术文本摘要结构识别研究*[J]. 数据分析与知识发现, 2023, 7(8): 105-118.
[13] 聂卉, 蔡瑞昇. 引入注意力机制的在线问诊推荐研究*[J]. 数据分析与知识发现, 2023, 7(8): 138-148.
[14] 李广建, 袁钺. 基于深度学习的科技文献知识单元抽取研究综述[J]. 数据分析与知识发现, 2023, 7(7): 1-17.
[15] 王楠, 王淇. 基于深度学习的学生课堂专注度测评方法*[J]. 数据分析与知识发现, 2023, 7(6): 123-133.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn