数据分析与知识发现  2024, Vol. 8 Issue (6): 44-55
1武汉大学信息资源研究中心 武汉 430072
2武汉大学信息管理学院 武汉 430072
Identifying Structural Function of Scientific Literature Abstracts Based on Deep Active Learning
Mao Jin1,2(),Chen Ziyang1,2
1Center for Studies of Information Resources, Wuhan University, Wuhan 430072, China
2School of Information Management, Wuhan University, Wuhan 430072, China
【目的】探究不同深度主动学习方法对科技文献摘要的结构功能识别效果和标注成本。【方法】提出基于主动学习和序列标注的科技文献摘要结构功能识别方法,构建考虑句间上下文序列信息的SciBERT-BiLSTM-CRF模型(SBCA),然后分别提出基于摘要单句和摘要全文两个维度的基于不确定性的主动学习策略,并在PubMed 20K数据集上进行实验。【结果】SBCA模型具有最佳的识别效果,与不考虑序列信息仅使用SciBERT模型相比,F1值提升了11.93个百分点。使用基于整篇摘要的最小置信度策略达到SBCA模型的最优F1值仅需使用60%数据,使用基于单句的最小置信度策略达到SBCA模型的最优F1值仅需使用65%数据。【局限】本研究中仅构建了基于不确定性的主动学习查询策略,未考虑构建其他类别的查询策略。【结论】基于深度主动学习的方法有助于在更低注释成本的前提下进行摘要结构功能识别。

关键词 深度学习文献结构功能识别语步主动学习知识组织    

[Objective] This paper explores different DeepAL methods for identifying the structural function of scientific literature abstracts and their labeling costs. [Methods] Firstly, we constructed a SciBERT-BiLSTM-CRF model for the abstracts (SBCA), which utilized the contextual sequence information between sentences. Then, we developed an uncertainty active learning strategy for single sentences and full text of the abstracts. Finally, we conducted experiments on the PubMed 20K dataset. [Results] The SBCA model showed the best recognition performance and increased the F1 value by 11.93%, compared to the SciBERT model without sequence information. Using the Least Confidence strategy based on the abstracts, our SBCA model achieved its optimal F1 value with 60% of the experimental data. Using the Least Confidence strategy based on sentences, the SBCA model achieved optimal F1 value with 65% of the experimental data. [Limitations] In the future, we need to examine different active learning strategies in more fields or multi-language datasets. [Conclusions] The new model based on deep active learning could identify the structural function of scientific literature with a lower annotation cost.

Key wordsDeep Learning    Document Structural Function Identification    Move    Active Learning    Knowledge Organization
收稿日期: 2023-05-12      出版日期: 2024-01-08
ZTFLH:  G35  
通讯作者: 毛进,ORCID:0000-0001-9572-6709,。   
毛进, 陈子洋. 基于深度主动学习的科技文献摘要结构功能识别研究*[J]. 数据分析与知识发现, 2024, 8(6): 44-55.
Mao Jin, Chen Ziyang. Identifying Structural Function of Scientific Literature Abstracts Based on Deep Active Learning. Data Analysis and Knowledge Discovery, 2024, 8(6): 44-55.
研究人员 词嵌入层 句子嵌入层 语义丰富层 输出层 结果
Dernoncourt等[32] Character Emb.+GloVe Bi-LSTM - CRF PubMed 20k数据集上得到的F1值为89.9%
Jin等[31] Bio Word2Vec 基于注意力池的
Bi-LSTM CRF PubMed 20K数据集上得到的F1值为92.6%
Cohan等[33] SciBERT SciBERT-[SEP] SciBERT-[SEP] Softmax PubMed 20K数据集上得到的F1值为92.9%
Brack等[34] SciBERT 基于注意力池的
Bi-LSTM CRF PubMed 20K数据集上得到的F1值为92.9%
Shang等[35] BERT Bi-LSTM SDLA CRF PubMed 20K数据集上得到F1值为92.8%
Table 1  序列标注代表性研究对比
Fig.1  基于主动学习的结构功能识别模型预测过程
Fig.2  摘要中单句长度分布
Fig.3  整体实验流程
模型 指标 结果(%)
SciBERT P 80.29
R 80.76
F1 80.66
SBCS P 87.35
R 87.14
F1 87.07
SBCA P 92.26
R 92.64
F1 92.59
Table 2  结构功能识别模型实验结果
Fig.4  前20周期各主动学习策略F1值折线图
周期 SBCA-Random(%) SBCA-LC-abs(%) SBCA-LC-sen(%)
1 90.790 90.312 90.968
2 91.058 91.235 91.459
3 91.324 91.958 91.621
4 91.409 92.066 91.841
5 91.511 92.172 92.070
6 91.664 92.097 92.098
7 91.673 92.325 92.261
8 91.867 92.351 92.233
9 91.862 92.444 92.378
10 91.915 92.571 92.484
11 91.938 92.546 92.457
12 92.028 92.601 92.568
13 92.046 92.582 92.629
14 92.192 92.594 92.583
15 92.116 92.623 92.641
16 92.093 92.749 92.781
17 92.241 92.654 92.826
18 92.207 92.707 92.972
19 92.296 92.738 93.011
20 92.354 92.727 92.945
Table 3  前20周期各主动学习策略F1值
结构功能 SBCA模型 SBCA-Random模型 SBCA-LC-abs模型 SBCA-LC-sen模型
P(%) R(%) F1(%) P(%) R(%) F1(%) P(%) R(%) F1(%) P(%) R(%) F1(%)
Background 73.59 81.50 77.35 72.75 84.00 77.97 79.92 79.82 79.87 81.76 85.89 83.78
Objective 73.93 61.54 67.17 71.37 68.94 70.13 75.16 72.77 73.95 81.47 74.36 77.75
Method 94.41 97.52 95.94 94.15 97.20 95.65 94.98 97.65 96.30 94.45 97.33 95.87
Result 96.61 94.34 95.47 96.36 94.33 95.34 96.89 95.14 96.01 96.60 94.33 95.45
Conclusion 95.53 94.88 95.21 95.91 95.13 95.52 96.31 95.81 96.06 95.79 95.32 95.55
Table 4  模型最佳识别结果
结构功能 Background Objective Method Result Conclusion
Background 1 954 421 62 1 10
Objective 438 1 368 65 6 3
Method 45 29 7 782 99 14
Result 3 1 269 7 452 108
Conclusion 5 1 15 133 3 519
Table 5  SBCA-LC-abs模型混淆矩阵
结构功能 Background Objective Method Result Conclusion
Background 2 094 279 52 2 11
Objective 409 1 398 67 3 3
Method 50 38 7 756 106 19
Result 5 1 317 7 389 121
Conclusion 3 0 20 149 3 501
Table 6  SBCA-LC-sen模型混淆矩阵
