Active Learning Strategies for Extracting Phrase-Level Topics from Scientific Literature
Tao Yue1,2,Yu Li1,3(),Zhang Runjie4
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China 2Department of Library, Information and Archives Management, School of Economics and Management, niversity of Chinese Academy of Sciences, Beijing 100190, China 3State Key Laboratory of Resources and Environmental Information System, Beijing 100101, China 4Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK
[Objective] This paper explores methods of extracting information from scientific literature with the help of active learning strategies, aiming to address the issue of lacking annotated corpus. [Methods] We constructed our new model based on three representative active learning strategies (MARGIN, NSE, MNLP) and one novel LWP strategy, as well as the neural network model (namely CNN-BiLSTM-CRF). Then, we extracted the task and method related information from texts with much fewer annotations. [Results] We examined our model with scientific articles with 10%~30% selectively annotated texts. The proposed model yielded the same results as those of models with 100% annotated texts. It significantly reduced the labor costs of corpus construction. [Limitations] The number of scientific articles in our sample corpus was small, which led to low precision issues. [Conclusions] The proposed model significantly reduces its reliance on the scale of annotated corpus. Compared with the existing active learning strategies, the MNLP yielded better results and normalizes the sentence length to improve the model’s stability. In the meantime, MARGIN performs well in the initial iteration to identify the low-value instances, while LWP is suitable for dataset with more semantic labels.
陶玥,余丽,张润杰. 科技文献中短语级主题抽取的主动学习方法研究*[J]. 数据分析与知识发现, 2020, 4(10): 134-143.
Tao Yue,Yu Li,Zhang Runjie. Active Learning Strategies for Extracting Phrase-Level Topics from Scientific Literature. Data Analysis and Knowledge Discovery, 2020, 4(10): 134-143.
Matos P F, Lombardi L O, Pardo T A S, et al. An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems[C]//Proceedings of the 23rd International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems-Volume Part I. 2010: 306-316.
[2]
Santos C N, Xiang B, Zhou B W. Classifying Relations by Ranking with Convolutional Neural Networks[OL]. arXiv Preprint, arXiv: 1504.06580.
[3]
Hu Z T, Ma X Z, Liu Z Z, et al. Harnessing Deep Neural Networks with Logic Rules[J]. Computing Research Repository, 2016,16(3):2410-2420.
[4]
Lample G, Ballesteros M, Subramanian S, et al. Neural Architectures for Named Entity Recognition[C]//Proceedings of NAACL-HLT 2016. 2016:260-270.
[5]
Jang Y, Choi H, Deng F, et al. Evaluation of Deep Learning Models for Information Extraction from EMF-Related Literature[C]//Proceedings of the Conference on Research in Adaptive and Convergent Systems. 2019: 113-116.
[6]
Gupta S, Manning C D. Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers[C]//Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011: 1-9.
( Wang Yuefen, Cao Jiajun, Yu Houqiang, et al. Identification and Analysis of Research Fronts in Artificial Intelligence: A Perspective Based on Global Evolution Study of the Domain[J]. Information Studies: Theory & Application, 2019,42(9):1-7.)
( Zhang Zhixiong, Liu Huan, Ding Liangping, et al. Identifying Moves of Research Abstracts with Deep Learning Methods[J]. Data Analysis and Knowledge Discovery, 2019,3(12):1-9.)
( Ding Liangping, Zhang Zhixiong, Liu Huan. Factors Affecting Rhetorical Move Recognition with SVM Model[J]. Data Analysis and Knowledge Discovery, 2019,3(11):16-23.)
[10]
Kulkarni S R, Mitter S K, Tsitsiklis J N, et al. Active Learning Using Arbitrary Binary Valued Queries[J]. Machine Learning, 1993,11:23-35.
doi: 10.1023/A:1022627018023
[11]
Culotta A, McCallum A. Reducing Labeling Effort for Structured Prediction Tasks[C]//Proceedings of the 20th National Conference on Artificial Intelligence. 2005: 746-751.
[12]
Sutton C, McCallum A. An Introduction to Conditional Random Fields for Relational Learning[J]. Foundations and Trends in Machine Learning, 2012,4(4):267-373.
doi: 10.1561/2200000013
[13]
Scheffer T, Decomain C, Wrobel S. Active Hidden Markov Models for Information Extraction[C]//Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis. 2001: 309-318.
[14]
Deng Y, Bao F, Deng X S, et al. Deep and Structured Robust Information Theoretic Learning for Image Analysis[J]. IEEE Transactions on Image Processing, 2016,25(9):4209-4221.
doi: 10.1109/TIP.2016.2588330
pmid: 27392359
[15]
Vijayanarasimhan S, Grauman K. Active Frame Selection for Label Propagation in Videos[C]//Proceedings of the 12th European Conference on Computer Vision-Volume Part V. 2012: 496-509.
[16]
Deng Y, Dai Q H, Liu R S, et al. Low-Rank Structure Learning via Nonconvex Heuristic Recovery[J]. IEEE Transactions on Neural Networks and Learning Systems, 2013,24(3):383-396.
doi: 10.1109/TNNLS.2012.2235082
pmid: 24808312
[17]
Deng Y, Chen K W, Shen Y L, et al. Adversarial Active Learning for Sequences Labeling and Generation[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018: 4012-4018.
[18]
Ma X Z, Hovy E. End-to-End Sequence Labeling via Bi-directional LSTM-CNNs-CRF[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 1064-1074.
[19]
Dyer C, Ballesteros M, Ling W, et al. 2015. Transition based Dependency Parsing with Stack Long Short-Term Memory[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 2015: 334-343.
[20]
Ratinov L, Roth D. Design Challenges and Misconceptions in Named Entity Recognition[C]//Proceedings of the 13th Conference on Computational Natural Language Learning. 2009: 147-155.
[21]
Shen Y Y, Yun H K, Lipton Z, et al. Deep Active Learning for Named Entity Recognition[C]//Proceedings of the 2nd Workshop on Representation Learning for NLP. 2017: 252-256.
[22]
LeCun Y, Bengio Y. Convolutional Networks for Images, Speech, and Time Series[A]//The Handbook of Brain Theory and Neural Networks[M]. MIT Press, 1998: 255-258.
[23]
Balcan M F, Broder A, Zhang T. MARGIN Based Active Learning[C]//Proceedings of the 20th Annual Conference on Learning Theory. 2007: 35-50.
[24]
Scheffer C, Wrobel S. Active Hidden Markov Models for Information Extraction[C]//Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis (CAIDA). 2001: 309-318.
[25]
Kim Y, Song K, Kim J W, et al. MMR-based Active Machine Learning for Bio Named Entity Recognition[C]//Proceedings of Human Language Technology and the North American Association for Computational Linguistics (HLT-NAACL). 2006: 69-72.
[26]
Lowell D, Lipton Z C, Wallace B C. How Transferable are the Datasets Collected by Active Learners? [OL]. arXiv Preprint, arXiv: 1807.04801.