Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (10): 134-143    DOI: 10.11925/infotech.2096-3467.2020.0281
Current Issue | Archive | Adv Search |
Active Learning Strategies for Extracting Phrase-Level Topics from Scientific Literature
Tao Yue1,2,Yu Li1,3(),Zhang Runjie4
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Library, Information and Archives Management, School of Economics and Management, niversity of Chinese Academy of Sciences, Beijing 100190, China
3State Key Laboratory of Resources and Environmental Information System, Beijing 100101, China
4Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK
Download: PDF (1132 KB)   HTML ( 16
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper explores methods of extracting information from scientific literature with the help of active learning strategies, aiming to address the issue of lacking annotated corpus. [Methods] We constructed our new model based on three representative active learning strategies (MARGIN, NSE, MNLP) and one novel LWP strategy, as well as the neural network model (namely CNN-BiLSTM-CRF). Then, we extracted the task and method related information from texts with much fewer annotations. [Results] We examined our model with scientific articles with 10%~30% selectively annotated texts. The proposed model yielded the same results as those of models with 100% annotated texts. It significantly reduced the labor costs of corpus construction. [Limitations] The number of scientific articles in our sample corpus was small, which led to low precision issues. [Conclusions] The proposed model significantly reduces its reliance on the scale of annotated corpus. Compared with the existing active learning strategies, the MNLP yielded better results and normalizes the sentence length to improve the model’s stability. In the meantime, MARGIN performs well in the initial iteration to identify the low-value instances, while LWP is suitable for dataset with more semantic labels.

Key wordsInformation Extraction      Active Learning      Neural Network     
Received: 03 April 2020      Published: 09 November 2020
ZTFLH:  TP393  
Corresponding Authors: Yu Li     E-mail: yul@mail.las.ac.cn

Cite this article:

Tao Yue,Yu Li,Zhang Runjie. Active Learning Strategies for Extracting Phrase-Level Topics from Scientific Literature. Data Analysis and Knowledge Discovery, 2020, 4(10): 134-143.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0281     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I10/134

An Information Extraction Framework Combing Neural Network and Active Learning
Workflow of Learning Engine
名称 优势 不足 适用条件
MARGIN 公式简单 遍历所有序列,效率低 当语料标注难易程度不均,可采用该策略找到标注难度大的实例
NSE 考虑最优N种标注,计算复杂度低 未考虑序列长度对序列价值影响 适用于大数据量的样本选择
MNLP 序列长度归一化处理,对长序列适度敏感 初始迭代阶段表现优异,但后续阶段提升不大 适用于在标签稀疏的语料中寻找可能包含特殊标签的样本
LWP 动态调整每类标签的敏感度 倾向于筛选出带有更多实例标签的句子 多类型且标签数量分布不均的语料
Comparative Analysis of Active Learning Strategies
Annotation Examples of AI Articles
F1 Scores of the Proposed Framework in Different Fields
">
F1 Scores of the Proposed Framework in Different Fields
策略 数据集 语料采样百分比
MNLP AI科技文献数据集 30%
FTD-FOCUS 20%
FTD-TECHNIQUE 10%
FTD-DOMAIN 20%
NSE AI科技文献数据集 60%
FTD-FOCUS 30%
FTD-TECHNIQUE 20%
FTD-DOMAIN 30%
MARGIN AI科技文献数据集 50%
FTD-FOCUS 60%
FTD-TECHNIQUE 30%
FTD-DOMAIN 40%
LWP AI科技文献数据集 40%
FTD-FOCUS 80%
FTD-TECHNIQUE 50%
FTD-DOMAIN 60%
The Corpus Sampling Percentage of Different AL Strategies
数据集 抽取信息 ACCPE
FTD FOCUS
TECHNIQUE
DOMAIN
79%
73%
82%
AI task 82%
method 79%
GIS task 85%
method 81%
ACCPE in Different Fields
数据集 抽取信息 最优F1
FTD FOCUS
TECHNIQUE
DOMAIN
55.33%
51.33%
57.73%
AI task 71.47%
method 70.03%
Best F1 Scores in Different Fields
[1] Matos P F, Lombardi L O, Pardo T A S, et al. An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems[C]//Proceedings of the 23rd International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems-Volume Part I. 2010: 306-316.
[2] Santos C N, Xiang B, Zhou B W. Classifying Relations by Ranking with Convolutional Neural Networks[OL]. arXiv Preprint, arXiv: 1504.06580.
[3] Hu Z T, Ma X Z, Liu Z Z, et al. Harnessing Deep Neural Networks with Logic Rules[J]. Computing Research Repository, 2016,16(3):2410-2420.
[4] Lample G, Ballesteros M, Subramanian S, et al. Neural Architectures for Named Entity Recognition[C]//Proceedings of NAACL-HLT 2016. 2016:260-270.
[5] Jang Y, Choi H, Deng F, et al. Evaluation of Deep Learning Models for Information Extraction from EMF-Related Literature[C]//Proceedings of the Conference on Research in Adaptive and Convergent Systems. 2019: 113-116.
[6] Gupta S, Manning C D. Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers[C]//Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011: 1-9.
[7] 王曰芬, 曹嘉君, 余厚强, 等. 人工智能研究前沿识别与分析:基于领域全局演化研究视角[J]. 情报理论与实践, 2019,42(9):1-7.
[7] ( Wang Yuefen, Cao Jiajun, Yu Houqiang, et al. Identification and Analysis of Research Fronts in Artificial Intelligence: A Perspective Based on Global Evolution Study of the Domain[J]. Information Studies: Theory & Application, 2019,42(9):1-7.)
[8] 张智雄, 刘欢, 丁良萍, 等. 不同深度学习模型的科技论文摘要语步识别效果对比研究[J]. 数据分析与知识发现, 2019,3(12):1-9.
[8] ( Zhang Zhixiong, Liu Huan, Ding Liangping, et al. Identifying Moves of Research Abstracts with Deep Learning Methods[J]. Data Analysis and Knowledge Discovery, 2019,3(12):1-9.)
[9] 丁良萍, 张智雄, 刘欢. 影响支持向量机模型语步自动识别效果的因素研究[J]. 数据分析与知识发现, 2019,3(11):16-23.
[9] ( Ding Liangping, Zhang Zhixiong, Liu Huan. Factors Affecting Rhetorical Move Recognition with SVM Model[J]. Data Analysis and Knowledge Discovery, 2019,3(11):16-23.)
[10] Kulkarni S R, Mitter S K, Tsitsiklis J N, et al. Active Learning Using Arbitrary Binary Valued Queries[J]. Machine Learning, 1993,11:23-35.
doi: 10.1023/A:1022627018023
[11] Culotta A, McCallum A. Reducing Labeling Effort for Structured Prediction Tasks[C]//Proceedings of the 20th National Conference on Artificial Intelligence. 2005: 746-751.
[12] Sutton C, McCallum A. An Introduction to Conditional Random Fields for Relational Learning[J]. Foundations and Trends in Machine Learning, 2012,4(4):267-373.
doi: 10.1561/2200000013
[13] Scheffer T, Decomain C, Wrobel S. Active Hidden Markov Models for Information Extraction[C]//Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis. 2001: 309-318.
[14] Deng Y, Bao F, Deng X S, et al. Deep and Structured Robust Information Theoretic Learning for Image Analysis[J]. IEEE Transactions on Image Processing, 2016,25(9):4209-4221.
doi: 10.1109/TIP.2016.2588330 pmid: 27392359
[15] Vijayanarasimhan S, Grauman K. Active Frame Selection for Label Propagation in Videos[C]//Proceedings of the 12th European Conference on Computer Vision-Volume Part V. 2012: 496-509.
[16] Deng Y, Dai Q H, Liu R S, et al. Low-Rank Structure Learning via Nonconvex Heuristic Recovery[J]. IEEE Transactions on Neural Networks and Learning Systems, 2013,24(3):383-396.
doi: 10.1109/TNNLS.2012.2235082 pmid: 24808312
[17] Deng Y, Chen K W, Shen Y L, et al. Adversarial Active Learning for Sequences Labeling and Generation[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018: 4012-4018.
[18] Ma X Z, Hovy E. End-to-End Sequence Labeling via Bi-directional LSTM-CNNs-CRF[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 1064-1074.
[19] Dyer C, Ballesteros M, Ling W, et al. 2015. Transition based Dependency Parsing with Stack Long Short-Term Memory[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 2015: 334-343.
[20] Ratinov L, Roth D. Design Challenges and Misconceptions in Named Entity Recognition[C]//Proceedings of the 13th Conference on Computational Natural Language Learning. 2009: 147-155.
[21] Shen Y Y, Yun H K, Lipton Z, et al. Deep Active Learning for Named Entity Recognition[C]//Proceedings of the 2nd Workshop on Representation Learning for NLP. 2017: 252-256.
[22] LeCun Y, Bengio Y. Convolutional Networks for Images, Speech, and Time Series[A]//The Handbook of Brain Theory and Neural Networks[M]. MIT Press, 1998: 255-258.
[23] Balcan M F, Broder A, Zhang T. MARGIN Based Active Learning[C]//Proceedings of the 20th Annual Conference on Learning Theory. 2007: 35-50.
[24] Scheffer C, Wrobel S. Active Hidden Markov Models for Information Extraction[C]//Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis (CAIDA). 2001: 309-318.
[25] Kim Y, Song K, Kim J W, et al. MMR-based Active Machine Learning for Bio Named Entity Recognition[C]//Proceedings of Human Language Technology and the North American Association for Computational Linguistics (HLT-NAACL). 2006: 69-72.
[26] Lowell D, Lipton Z C, Wallace B C. How Transferable are the Datasets Collected by Active Learners? [OL]. arXiv Preprint, arXiv: 1807.04801.
[27] FTDDataset_v1 [DS/OL]. [2020-05-21].https://nlp.stanford.edu/pubs/FTDDataset_v1.txt.
[1] Tan Ying, Tang Yifei. Extracting Citation Contents with Coreference Resolution[J]. 数据分析与知识发现, 2021, 5(8): 25-33.
[2] Gu Yaowen, Zhang Bowen, Zheng Si, Yang Fengchun, Li Jiao. Predicting Drug ADMET Properties Based on Graph Attention Network[J]. 数据分析与知识发现, 2021, 5(8): 76-85.
[3] Zhang Le, Leng Jidong, Lv Xueqiang, Cui Zhuo, Wang Lei, You Xindong. RLCPAR: A Rewriting Model for Chinese Patent Abstracts Based on Reinforcement Learning[J]. 数据分析与知识发现, 2021, 5(7): 59-69.
[4] Han Pu,Zhang Zhanpeng,Zhang Mingtao,Gu Liang. Normalizing Chinese Disease Names with Multi-feature Fusion[J]. 数据分析与知识发现, 2021, 5(5): 83-94.
[5] Wang Nan,Li Hairong,Tan Shuru. Predicting of Public Opinion Reversal with Improved SMOTE Algorithm and Ensemble Learning[J]. 数据分析与知识发现, 2021, 5(4): 37-48.
[6] Li Danyang, Gan Mingxin. Music Recommendation Method Based on Multi-Source Information Fusion[J]. 数据分析与知识发现, 2021, 5(2): 94-105.
[7] Ding Hao, Ai Wenhua, Hu Guangwei, Li Shuqing, Suo Wei. A Personalized Recommendation Model with Time Series Fluctuation of User Interest[J]. 数据分析与知识发现, 2021, 5(11): 45-58.
[8] Yin Haoran,Cao Jinxuan,Cao Luzhe,Wang Guodong. Identifying Emergency Elements Based on BiGRU-AM Model with Extended Semantic Dimension[J]. 数据分析与知识发现, 2020, 4(9): 91-99.
[9] Qiu Erli,He Hongwei,Yi Chengqi,Li Huiying. Research on Public Policy Support Based on Character-level CNN Technology[J]. 数据分析与知识发现, 2020, 4(7): 28-37.
[10] Liu Weijiang,Wei Hai,Yun Tianhe. Evaluation Model for Customer Credits Based on Convolutional Neural Network[J]. 数据分析与知识发现, 2020, 4(6): 80-90.
[11] Wang Mo,Cui Yunpeng,Chen Li,Li Huan. A Deep Learning-based Method of Argumentative Zoning for Research Articles[J]. 数据分析与知识发现, 2020, 4(6): 60-68.
[12] Yan Chun,Liu Lu. Classifying Non-life Insurance Customers Based on Improved SOM and RFM Models[J]. 数据分析与知识发现, 2020, 4(4): 83-90.
[13] Su Chuandong,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Mao Junyu,Zhu Jiaying,Pan Yuhao. Identifying Chinese / English Metaphors with Word Embedding and Recurrent Neural Network[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[14] Xu Yuemei,Liu Yunwen,Cai Lianqiao. Predicitng Retweets of Government Microblogs with Deep-combined Features[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[15] Xiang Fei,Xie Yaotan. Recognition Model of Patient Reviews Based on Mixed Sampling and Transfer Learning[J]. 数据分析与知识发现, 2020, 4(2/3): 39-47.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn