Data Analysis and Knowledge Discovery  2024, Vol. 8 Issue (6): 44-55    DOI: 10.11925/infotech.2096-3467.2023.0448
Identifying Structural Function of Scientific Literature Abstracts Based on Deep Active Learning
Mao Jin1,2(),Chen Ziyang1,2
1Center for Studies of Information Resources, Wuhan University, Wuhan 430072, China
2School of Information Management, Wuhan University, Wuhan 430072, China
[Objective] This paper explores different DeepAL methods for identifying the structural function of scientific literature abstracts and their labeling costs. [Methods] Firstly, we constructed a SciBERT-BiLSTM-CRF model for the abstracts (SBCA), which utilized the contextual sequence information between sentences. Then, we developed an uncertainty active learning strategy for single sentences and full text of the abstracts. Finally, we conducted experiments on the PubMed 20K dataset. [Results] The SBCA model showed the best recognition performance and increased the F1 value by 11.93%, compared to the SciBERT model without sequence information. Using the Least Confidence strategy based on the abstracts, our SBCA model achieved its optimal F1 value with 60% of the experimental data. Using the Least Confidence strategy based on sentences, the SBCA model achieved optimal F1 value with 65% of the experimental data. [Limitations] In the future, we need to examine different active learning strategies in more fields or multi-language datasets. [Conclusions] The new model based on deep active learning could identify the structural function of scientific literature with a lower annotation cost.

Key wordsDeep Learning      Document Structural Function Identification      Move      Active Learning      Knowledge Organization     
Received: 12 May 2023      Published: 08 January 2024
ZTFLH:  G35  
Fund:National Natural Science Foundation of China(72174154);Major Projects of Education Ministry’s Key Research Base for Humanities and Social Sciences(22JJD870005)
Corresponding Authors: Mao Jin,ORCID:0000-0001-9572-6709,。   

Cite this article:

Mao Jin, Chen Ziyang. Identifying Structural Function of Scientific Literature Abstracts Based on Deep Active Learning. Data Analysis and Knowledge Discovery, 2024, 8(6): 44-55.

研究人员 词嵌入层 句子嵌入层 语义丰富层 输出层 结果
Dernoncourt等[32] Character Emb.+GloVe Bi-LSTM - CRF PubMed 20k数据集上得到的F1值为89.9%
Jin等[31] Bio Word2Vec 基于注意力池的
Bi-LSTM CRF PubMed 20K数据集上得到的F1值为92.6%
Cohan等[33] SciBERT SciBERT-[SEP] SciBERT-[SEP] Softmax PubMed 20K数据集上得到的F1值为92.9%
Brack等[34] SciBERT 基于注意力池的
Bi-LSTM CRF PubMed 20K数据集上得到的F1值为92.9%
Shang等[35] BERT Bi-LSTM SDLA CRF PubMed 20K数据集上得到F1值为92.8%
Comparison of Representative Studies on Sequence Annotation
Prediction Process of Structural Function Recognition Model Based on Active Learning
Distribution of Sentence Length in Abstract
Experiment Process
模型 指标 结果(%)
SciBERT P 80.29
R 80.76
F1 80.66
SBCS P 87.35
R 87.14
F1 87.07
SBCA P 92.26
R 92.64
F1 92.59
Experimental Results of Structural Function Identification Model
Line Chart of F1 Values of Active Learning Strategies in the First 20 Epochs
周期 SBCA-Random(%) SBCA-LC-abs(%) SBCA-LC-sen(%)
1 90.790 90.312 90.968
2 91.058 91.235 91.459
3 91.324 91.958 91.621
4 91.409 92.066 91.841
5 91.511 92.172 92.070
6 91.664 92.097 92.098
7 91.673 92.325 92.261
8 91.867 92.351 92.233
9 91.862 92.444 92.378
10 91.915 92.571 92.484
11 91.938 92.546 92.457
12 92.028 92.601 92.568
13 92.046 92.582 92.629
14 92.192 92.594 92.583
15 92.116 92.623 92.641
16 92.093 92.749 92.781
17 92.241 92.654 92.826
18 92.207 92.707 92.972
19 92.296 92.738 93.011
20 92.354 92.727 92.945
F1 Value of Active Learning Strategies in the First 20 Epochs
结构功能 SBCA模型 SBCA-Random模型 SBCA-LC-abs模型 SBCA-LC-sen模型
P(%) R(%) F1(%) P(%) R(%) F1(%) P(%) R(%) F1(%) P(%) R(%) F1(%)
Background 73.59 81.50 77.35 72.75 84.00 77.97 79.92 79.82 79.87 81.76 85.89 83.78
Objective 73.93 61.54 67.17 71.37 68.94 70.13 75.16 72.77 73.95 81.47 74.36 77.75
Method 94.41 97.52 95.94 94.15 97.20 95.65 94.98 97.65 96.30 94.45 97.33 95.87
Result 96.61 94.34 95.47 96.36 94.33 95.34 96.89 95.14 96.01 96.60 94.33 95.45
Conclusion 95.53 94.88 95.21 95.91 95.13 95.52 96.31 95.81 96.06 95.79 95.32 95.55
Best Recognition Results of Model
结构功能 Background Objective Method Result Conclusion
Background 1 954 421 62 1 10
Objective 438 1 368 65 6 3
Method 45 29 7 782 99 14
Result 3 1 269 7 452 108
Conclusion 5 1 15 133 3 519
Confusion Matrix of SBCA-LC-abs Model
结构功能 Background Objective Method Result Conclusion
Background 2 094 279 52 2 11
Objective 409 1 398 67 3 3
Method 50 38 7 756 106 19
Result 5 1 317 7 389 121
Conclusion 3 0 20 149 3 501
Confusion Matrix of SBCA-LC-sen Model
