Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (10): 134-143     https://doi.org/10.11925/infotech.2096-3467.2020.0281
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
科技文献中短语级主题抽取的主动学习方法研究*
陶玥1,2,余丽1,3(),张润杰4
1中国科学院文献情报中心 北京 100190
2中国科学院大学经济与管理学院图书情报与档案管理系 北京 100190
3中国科学院地理科学与资源研究所资源与环境信息系统国家重点实验室 北京 100101
4南安普顿大学电子与计算机科学学院 南安普顿 SO17 1BJ
Active Learning Strategies for Extracting Phrase-Level Topics from Scientific Literature
Tao Yue1,2,Yu Li1,3(),Zhang Runjie4
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Library, Information and Archives Management, School of Economics and Management, niversity of Chinese Academy of Sciences, Beijing 100190, China
3State Key Laboratory of Resources and Environmental Information System, Beijing 100101, China
4Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK
全文: PDF (1132 KB)   HTML ( 13
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 在标注语料匮乏的情况下,利用主动学习策略,探索科技文献信息抽取的有效解决方案。【方法】 设计一种融合主动学习的神经网络模型架构,将三种代表性的主动学习策略(MARGIN,NSE,MNLP)和新提出的LWP策略与神经网络信息抽取模型(CNN-BiLSTM-CRF)结合,研究适用于标注语料匮乏的任务驱动型信息抽取方法。【结果】 在主动学习引导下,仅选择性标注10%~30%数据,即可达到神经网络模型训练100%标注数据的效果,可大大降低标注语料库构建过程中的人力成本。【局限】 人工智能领域科技文献数据集规模小、噪声多,信息抽取模型的精确率低。【结论】 主动学习策略指导下的神经网络模型,大幅缩减了所需标注语料库的规模。对比4种主动学习策略发现:MNLP策略显著优于其他策略;MARGIN策略在初始迭代阶段表现优异且能辨别出低价值的实例;基于句长规范化的MNLP策略能促进模型的稳定性;LWP适用于语义标签占比大的数据集。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
陶玥
余丽
张润杰
关键词 信息抽取主动学习神经网络    
Abstract

[Objective] This paper explores methods of extracting information from scientific literature with the help of active learning strategies, aiming to address the issue of lacking annotated corpus. [Methods] We constructed our new model based on three representative active learning strategies (MARGIN, NSE, MNLP) and one novel LWP strategy, as well as the neural network model (namely CNN-BiLSTM-CRF). Then, we extracted the task and method related information from texts with much fewer annotations. [Results] We examined our model with scientific articles with 10%~30% selectively annotated texts. The proposed model yielded the same results as those of models with 100% annotated texts. It significantly reduced the labor costs of corpus construction. [Limitations] The number of scientific articles in our sample corpus was small, which led to low precision issues. [Conclusions] The proposed model significantly reduces its reliance on the scale of annotated corpus. Compared with the existing active learning strategies, the MNLP yielded better results and normalizes the sentence length to improve the model’s stability. In the meantime, MARGIN performs well in the initial iteration to identify the low-value instances, while LWP is suitable for dataset with more semantic labels.

Key wordsInformation Extraction    Active Learning    Neural Network
收稿日期: 2020-04-03      出版日期: 2020-11-09
ZTFLH:  TP393  
基金资助:*本文系国家自然科学基金青年科学基金项目“中文网络文本的地理实体语义关系标注与评价”(41801320);资源与环境信息系统国家重点实验室开放基金的研究成果之一
通讯作者: 余丽     E-mail: yul@mail.las.ac.cn
引用本文:   
陶玥,余丽,张润杰. 科技文献中短语级主题抽取的主动学习方法研究*[J]. 数据分析与知识发现, 2020, 4(10): 134-143.
Tao Yue,Yu Li,Zhang Runjie. Active Learning Strategies for Extracting Phrase-Level Topics from Scientific Literature. Data Analysis and Knowledge Discovery, 2020, 4(10): 134-143.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.0281      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I10/134
Fig.1  融合主动学习的神经网络信息抽取框架
Fig.2  学习引擎框架
名称 优势 不足 适用条件
MARGIN 公式简单 遍历所有序列,效率低 当语料标注难易程度不均,可采用该策略找到标注难度大的实例
NSE 考虑最优N种标注,计算复杂度低 未考虑序列长度对序列价值影响 适用于大数据量的样本选择
MNLP 序列长度归一化处理,对长序列适度敏感 初始迭代阶段表现优异,但后续阶段提升不大 适用于在标签稀疏的语料中寻找可能包含特殊标签的样本
LWP 动态调整每类标签的敏感度 倾向于筛选出带有更多实例标签的句子 多类型且标签数量分布不均的语料
Table 1  主动学习策略分析
Fig.3  AI领域文献数据标注示例
Fig.4  不同领域科技文献中神经网络主动信息抽取模型的F1值对比
策略 数据集 语料采样百分比
MNLP AI科技文献数据集 30%
FTD-FOCUS 20%
FTD-TECHNIQUE 10%
FTD-DOMAIN 20%
NSE AI科技文献数据集 60%
FTD-FOCUS 30%
FTD-TECHNIQUE 20%
FTD-DOMAIN 30%
MARGIN AI科技文献数据集 50%
FTD-FOCUS 60%
FTD-TECHNIQUE 30%
FTD-DOMAIN 40%
LWP AI科技文献数据集 40%
FTD-FOCUS 80%
FTD-TECHNIQUE 50%
FTD-DOMAIN 60%
Table 2  增加主动学习后模型到达最佳性能的语料采样百分比
数据集 抽取信息 ACCPE
FTD FOCUS
TECHNIQUE
DOMAIN
79%
73%
82%
AI task 82%
method 79%
GIS task 85%
method 81%
Table 3  不同领域信息抽取的ACCPE
数据集 抽取信息 最优F1
FTD FOCUS
TECHNIQUE
DOMAIN
55.33%
51.33%
57.73%
AI task 71.47%
method 70.03%
Table 4  不同领域信息抽取的最佳F1
[1] Matos P F, Lombardi L O, Pardo T A S, et al. An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems[C]//Proceedings of the 23rd International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems-Volume Part I. 2010: 306-316.
[2] Santos C N, Xiang B, Zhou B W. Classifying Relations by Ranking with Convolutional Neural Networks[OL]. arXiv Preprint, arXiv: 1504.06580.
[3] Hu Z T, Ma X Z, Liu Z Z, et al. Harnessing Deep Neural Networks with Logic Rules[J]. Computing Research Repository, 2016,16(3):2410-2420.
[4] Lample G, Ballesteros M, Subramanian S, et al. Neural Architectures for Named Entity Recognition[C]//Proceedings of NAACL-HLT 2016. 2016:260-270.
[5] Jang Y, Choi H, Deng F, et al. Evaluation of Deep Learning Models for Information Extraction from EMF-Related Literature[C]//Proceedings of the Conference on Research in Adaptive and Convergent Systems. 2019: 113-116.
[6] Gupta S, Manning C D. Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers[C]//Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011: 1-9.
[7] 王曰芬, 曹嘉君, 余厚强, 等. 人工智能研究前沿识别与分析:基于领域全局演化研究视角[J]. 情报理论与实践, 2019,42(9):1-7.
[7] ( Wang Yuefen, Cao Jiajun, Yu Houqiang, et al. Identification and Analysis of Research Fronts in Artificial Intelligence: A Perspective Based on Global Evolution Study of the Domain[J]. Information Studies: Theory & Application, 2019,42(9):1-7.)
[8] 张智雄, 刘欢, 丁良萍, 等. 不同深度学习模型的科技论文摘要语步识别效果对比研究[J]. 数据分析与知识发现, 2019,3(12):1-9.
[8] ( Zhang Zhixiong, Liu Huan, Ding Liangping, et al. Identifying Moves of Research Abstracts with Deep Learning Methods[J]. Data Analysis and Knowledge Discovery, 2019,3(12):1-9.)
[9] 丁良萍, 张智雄, 刘欢. 影响支持向量机模型语步自动识别效果的因素研究[J]. 数据分析与知识发现, 2019,3(11):16-23.
[9] ( Ding Liangping, Zhang Zhixiong, Liu Huan. Factors Affecting Rhetorical Move Recognition with SVM Model[J]. Data Analysis and Knowledge Discovery, 2019,3(11):16-23.)
[10] Kulkarni S R, Mitter S K, Tsitsiklis J N, et al. Active Learning Using Arbitrary Binary Valued Queries[J]. Machine Learning, 1993,11:23-35.
doi: 10.1023/A:1022627018023
[11] Culotta A, McCallum A. Reducing Labeling Effort for Structured Prediction Tasks[C]//Proceedings of the 20th National Conference on Artificial Intelligence. 2005: 746-751.
[12] Sutton C, McCallum A. An Introduction to Conditional Random Fields for Relational Learning[J]. Foundations and Trends in Machine Learning, 2012,4(4):267-373.
doi: 10.1561/2200000013
[13] Scheffer T, Decomain C, Wrobel S. Active Hidden Markov Models for Information Extraction[C]//Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis. 2001: 309-318.
[14] Deng Y, Bao F, Deng X S, et al. Deep and Structured Robust Information Theoretic Learning for Image Analysis[J]. IEEE Transactions on Image Processing, 2016,25(9):4209-4221.
doi: 10.1109/TIP.2016.2588330 pmid: 27392359
[15] Vijayanarasimhan S, Grauman K. Active Frame Selection for Label Propagation in Videos[C]//Proceedings of the 12th European Conference on Computer Vision-Volume Part V. 2012: 496-509.
[16] Deng Y, Dai Q H, Liu R S, et al. Low-Rank Structure Learning via Nonconvex Heuristic Recovery[J]. IEEE Transactions on Neural Networks and Learning Systems, 2013,24(3):383-396.
doi: 10.1109/TNNLS.2012.2235082 pmid: 24808312
[17] Deng Y, Chen K W, Shen Y L, et al. Adversarial Active Learning for Sequences Labeling and Generation[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018: 4012-4018.
[18] Ma X Z, Hovy E. End-to-End Sequence Labeling via Bi-directional LSTM-CNNs-CRF[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 1064-1074.
[19] Dyer C, Ballesteros M, Ling W, et al. 2015. Transition based Dependency Parsing with Stack Long Short-Term Memory[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 2015: 334-343.
[20] Ratinov L, Roth D. Design Challenges and Misconceptions in Named Entity Recognition[C]//Proceedings of the 13th Conference on Computational Natural Language Learning. 2009: 147-155.
[21] Shen Y Y, Yun H K, Lipton Z, et al. Deep Active Learning for Named Entity Recognition[C]//Proceedings of the 2nd Workshop on Representation Learning for NLP. 2017: 252-256.
[22] LeCun Y, Bengio Y. Convolutional Networks for Images, Speech, and Time Series[A]//The Handbook of Brain Theory and Neural Networks[M]. MIT Press, 1998: 255-258.
[23] Balcan M F, Broder A, Zhang T. MARGIN Based Active Learning[C]//Proceedings of the 20th Annual Conference on Learning Theory. 2007: 35-50.
[24] Scheffer C, Wrobel S. Active Hidden Markov Models for Information Extraction[C]//Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis (CAIDA). 2001: 309-318.
[25] Kim Y, Song K, Kim J W, et al. MMR-based Active Machine Learning for Bio Named Entity Recognition[C]//Proceedings of Human Language Technology and the North American Association for Computational Linguistics (HLT-NAACL). 2006: 69-72.
[26] Lowell D, Lipton Z C, Wallace B C. How Transferable are the Datasets Collected by Active Learners? [OL]. arXiv Preprint, arXiv: 1807.04801.
[27] FTDDataset_v1 [DS/OL]. [2020-05-21].https://nlp.stanford.edu/pubs/FTDDataset_v1.txt.
[1] 范少萍,赵雨宣,安新颖,吴清强. 基于卷积神经网络的医学实体关系分类模型研究*[J]. 数据分析与知识发现, 2021, 5(9): 75-84.
[2] 范涛,王昊,吴鹏. 基于图卷积神经网络和依存句法分析的网民负面情感分析研究*[J]. 数据分析与知识发现, 2021, 5(9): 97-106.
[3] 谭荧, 唐亦非. 基于指代消解的引文内容抽取研究*[J]. 数据分析与知识发现, 2021, 5(8): 25-33.
[4] 顾耀文, 张博文, 郑思, 杨丰春, 李姣. 基于图注意力网络的药物ADMET分类预测模型构建方法*[J]. 数据分析与知识发现, 2021, 5(8): 76-85.
[5] 张乐, 冷基栋, 吕学强, 崔卓, 王磊, 游新冬. RLCPAR:一种基于强化学习的中文专利摘要改写模型*[J]. 数据分析与知识发现, 2021, 5(7): 59-69.
[6] 韩普,张展鹏,张明淘,顾亮. 基于多特征融合的中文疾病名称归一化研究*[J]. 数据分析与知识发现, 2021, 5(5): 83-94.
[7] 孟镇,王昊,虞为,邓三鸿,张宝隆. 基于特征融合的声乐分类研究*[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[8] 王楠,李海荣,谭舒孺. 基于改进SMOTE算法与集成学习的舆情反转预测研究*[J]. 数据分析与知识发现, 2021, 5(4): 37-48.
[9] 李丹阳, 甘明鑫. 基于多源信息融合的音乐推荐方法 *[J]. 数据分析与知识发现, 2021, 5(2): 94-105.
[10] 程铁军, 王曼, 黄宝凤, 冯兰萍. 基于CEEMDAN-BP模型的突发事件网络舆情预测研究*[J]. 数据分析与知识发现, 2021, 5(11): 59-67.
[11] 丁浩, 艾文华, 胡广伟, 李树青, 索炜. 融合用户兴趣波动时序的个性化推荐模型*[J]. 数据分析与知识发现, 2021, 5(11): 45-58.
[12] 尹浩然,曹金璇,曹鲁喆,王国栋. 扩充语义维度的BiGRU-AM突发事件要素识别研究*[J]. 数据分析与知识发现, 2020, 4(9): 91-99.
[13] 邱尔丽,何鸿魏,易成岐,李慧颖. 基于字符级CNN技术的公共政策网民支持度研究 *[J]. 数据分析与知识发现, 2020, 4(7): 28-37.
[14] 王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[15] 刘伟江,魏海,运天鹤. 基于卷积神经网络的客户信用评估模型研究*[J]. 数据分析与知识发现, 2020, 4(6): 80-90.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn