Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (10): 134-143    DOI: 10.11925/infotech.2096-3467.2020.0281
Active Learning Strategies for Extracting Phrase-Level Topics from Scientific Literature
Tao Yue1,2,Yu Li1,3(),Zhang Runjie4
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Library, Information and Archives Management, School of Economics and Management, niversity of Chinese Academy of Sciences, Beijing 100190, China
3State Key Laboratory of Resources and Environmental Information System, Beijing 100101, China
4Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK
[Objective] This paper explores methods of extracting information from scientific literature with the help of active learning strategies, aiming to address the issue of lacking annotated corpus. [Methods] We constructed our new model based on three representative active learning strategies (MARGIN, NSE, MNLP) and one novel LWP strategy, as well as the neural network model (namely CNN-BiLSTM-CRF). Then, we extracted the task and method related information from texts with much fewer annotations. [Results] We examined our model with scientific articles with 10%~30% selectively annotated texts. The proposed model yielded the same results as those of models with 100% annotated texts. It significantly reduced the labor costs of corpus construction. [Limitations] The number of scientific articles in our sample corpus was small, which led to low precision issues. [Conclusions] The proposed model significantly reduces its reliance on the scale of annotated corpus. Compared with the existing active learning strategies, the MNLP yielded better results and normalizes the sentence length to improve the model’s stability. In the meantime, MARGIN performs well in the initial iteration to identify the low-value instances, while LWP is suitable for dataset with more semantic labels.

Key wordsInformation Extraction      Active Learning      Neural Network     
Received: 03 April 2020      Published: 09 November 2020
ZTFLH:  TP393  
Corresponding Authors: Yu Li     E-mail:

Cite this article:

Tao Yue,Yu Li,Zhang Runjie. Active Learning Strategies for Extracting Phrase-Level Topics from Scientific Literature. Data Analysis and Knowledge Discovery, 2020, 4(10): 134-143.

URL:     OR

An Information Extraction Framework Combing Neural Network and Active Learning
Workflow of Learning Engine
名称 优势 不足 适用条件
MARGIN 公式简单 遍历所有序列,效率低 当语料标注难易程度不均,可采用该策略找到标注难度大的实例
NSE 考虑最优N种标注,计算复杂度低 未考虑序列长度对序列价值影响 适用于大数据量的样本选择
MNLP 序列长度归一化处理,对长序列适度敏感 初始迭代阶段表现优异,但后续阶段提升不大 适用于在标签稀疏的语料中寻找可能包含特殊标签的样本
LWP 动态调整每类标签的敏感度 倾向于筛选出带有更多实例标签的句子 多类型且标签数量分布不均的语料
Comparative Analysis of Active Learning Strategies
Annotation Examples of AI Articles
F1 Scores of the Proposed Framework in Different Fields
F1 Scores of the Proposed Framework in Different Fields
策略 数据集 语料采样百分比
MNLP AI科技文献数据集 30%
NSE AI科技文献数据集 60%
MARGIN AI科技文献数据集 50%
LWP AI科技文献数据集 40%
The Corpus Sampling Percentage of Different AL Strategies
数据集 抽取信息 ACCPE
AI task 82%
method 79%
GIS task 85%
method 81%
ACCPE in Different Fields
数据集 抽取信息 最优F1
AI task 71.47%
method 70.03%
Best F1 Scores in Different Fields
