Extracting Fine-grained Knowledge Units from Texts with Deep Learning
Li Yu1,3,Li Qian1,2(),Changlei Fu1,Huaming Zhao1
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China 2Department of Library, Information and Achieve Management, University of Chinese Academy of Sciences, Beijing 100190, China 3State Key Laboratory of Resources and Environmental Information System, Beijing 100101, China
[Objective] This paper tries to extract fine-grained knowledge units from texts with a deep learning model based on the modified bootstrapping method. [Methods] First, we built the lexicon for each type of knowledge unit with the help of search engine and keywords from Elsevier. Second, we created a large annotated corpus based on the bootstrapping method. Third, we controlled the quality of annotation with the estimation models of patterns and knowledge units. Finally, we trained the proposed LSTM-CRF model with the annotated corpus, and extracted new knowledge units from texts. [Results] We retrieved four types of knowledge units (study scope, research method, experimental data, as well as evaluation criteria and their values) from 17,756 ACL papers. The average precision was 91%, which was calculated manually. [Limitations] The parameters of models were pre-defined and modified by human. More research is needed to evaluate the performance of this method with texts from other domains. [Conclusions] The proposed model effectively addresses the issue of semantic drifting. It could extract knowledge units precisely, which is an effective solution for the big data acquisition process of intelligence analysis.
余丽,钱力,付常雷,赵华茗. 基于深度学习的文本中细粒度知识元抽取方法研究*[J]. 数据分析与知识发现, 2019, 3(1): 38-45.
Li Yu,Li Qian,Changlei Fu,Huaming Zhao. Extracting Fine-grained Knowledge Units from Texts with Deep Learning. Data Analysis and Knowledge Discovery, 2019, 3(1): 38-45.
(Ding Heng, Lu Wei.Building Standard Literature Knowledge Service System[J]. New Technology of Library and Information Service, 2016(7-8): 120-128.)
Augenstein I, Das M, Riedel S, et al. SemEval2017 Task 10: ScienceIE-Extracting Keyphrases and Relations from Scientific Publications[C]//Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 2017.
(Zeng Wen, Xu Shuo, Zhang Yunliang, et al.The Research and Analysis on Automatic Extraction of Science and Technology Literature Terms[J]. New Technology of Library and Information Service, 2014(1): 51-55.)
Gupta S, Manning C D.Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers[C]// Proceedings of the 5th International Joint Conference on Natural Language Processing, 2011.
(Guo Hongmei, Kong Beibei, Zhang Zhixiong.Study on Textual Topic Identification by Clustering Clique Structure in Multi-Relationship Text Graph[J]. Journal of the China Society for Scientific and Technical Information, 2017, 36(5): 433-442.)
(Qian Li, Zhang Xiaolin, Wang Qian.Building and Implement on Automatic Identification Method of Research Design Fingerprint of Scientific Papers[J]. Library and Information Service, 2018, 62(2): 135-143.)
(Guo Shaoqing, Le Xiaoqiu.Identifying Actual Value of Numerical Indicator from Scientific Paper[J]. Data Analysis and Knowledge Discovery, 2018, 2(1): 21-28.)
Dan S, Agarwal S, Singh M, et al.Which Techniques does Your Application Use?: An Information Extraction Framework for Scientific Articles[OL]. ArXiv Preprint, arXiv: 1608.06386.
Singh M, Dan S, Agarwal S,et al.App TechMiner: Minging Applications and Techniques from Scientific Articles[C]// Proceedings of the 6th International Workshop on Mining Scientific Publications. 2017: 1-8.
Tsai C T, Kundu G, Roth D.Concept-based Analysis of Scientific Literature[C]//Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 2013: 1733-1738.
(Yang Ya, Yang Zhihao, Lin Hongfei.MBNER: Multiple Biomedical Named Entity Recognition System for Biomedical Literature[J]. Journal of Chinese Information Processing, 2016, 30(1): 170-182.)
Okamoto M, Shan Z, Orihara R.Applying Information Extraction for Patent Structure Analysis[C] //Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017: 989-992.
Wagstaff K L, Francis R, Gowda T, et al.Mars Target Encyclopedia: Rock and Soil Composition Extracted from the Literature[C]// Proceedings of the 30th Annual Conference on Innovative Applications of Artificial Intelligence, 2018.
Basaldella M, Antolli E, Serra G,, et al.Bidirectional LSTM Recurrent Neural Network for Keyphrase Extraction[C]// Proceedings of the Italian Research Conference on Digital Libraries, 2018: 180-187.