Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (1): 38-45     https://doi.org/10.11925/infotech.2096-3467.2018.1352
  专题 本期目录 | 过刊浏览 | 高级检索 |
基于深度学习的文本中细粒度知识元抽取方法研究*
余丽1,3,钱力1,2(),付常雷1,赵华茗1
1中国科学院文献情报中心 北京 100190
2中国科学院大学图书情报与档案管理系 北京 100190
3资源与环境信息系统国家重点实验室 北京 100101
Extracting Fine-grained Knowledge Units from Texts with Deep Learning
Li Yu1,3,Li Qian1,2(),Changlei Fu1,Huaming Zhao1
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Library, Information and Achieve Management, University of Chinese Academy of Sciences, Beijing 100190, China
3State Key Laboratory of Resources and Environmental Information System, Beijing 100101, China
全文: PDF (2582 KB)   HTML ( 15
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】改进Bootstrapping方法, 建立深度学习模型从文本中抽取多类型细粒度的知识元。【方法】利用搜索引擎和Elsevier关键词构建知识元词库; 基于Bootstrapping技术自动构建大规模的标注语料库, 利用知识元评分模型和模式评分模型控制标注的质量; 基于已标注多类型知识元的语料库训练LSTM-CRF模型, 从文本中抽取新的知识元。【结果】基于17 756篇ACL论文摘要抽取“研究范畴”、“研究方法”、“实验数据”、“评价指标及取值”这4种知识元, 其人工评价平均正确率为91%。【局限】模型参数的预设与调整需要人工参与, 未对不同领域文本进行适用性验证。【结论】引入知识元与模式的评分模型, 能够有效缓解“语义漂移”问题; 基于深度学习模型抽取知识元实现快速且正确率高, 为情报大数据智能分析提供了一种高效可靠的数据获取手段。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
余丽
钱力
付常雷
赵华茗
关键词 知识元抽取命名实体识别深度学习BootstrappingLSTM-CRF    
Abstract

[Objective] This paper tries to extract fine-grained knowledge units from texts with a deep learning model based on the modified bootstrapping method. [Methods] First, we built the lexicon for each type of knowledge unit with the help of search engine and keywords from Elsevier. Second, we created a large annotated corpus based on the bootstrapping method. Third, we controlled the quality of annotation with the estimation models of patterns and knowledge units. Finally, we trained the proposed LSTM-CRF model with the annotated corpus, and extracted new knowledge units from texts. [Results] We retrieved four types of knowledge units (study scope, research method, experimental data, as well as evaluation criteria and their values) from 17,756 ACL papers. The average precision was 91%, which was calculated manually. [Limitations] The parameters of models were pre-defined and modified by human. More research is needed to evaluate the performance of this method with texts from other domains. [Conclusions] The proposed model effectively addresses the issue of semantic drifting. It could extract knowledge units precisely, which is an effective solution for the big data acquisition process of intelligence analysis.

Key wordsKnowledge Unit Extraction    Named Entity Recognition    Deep Learning    Bootstrapping    LSTM-CRF
收稿日期: 2018-12-02      出版日期: 2019-03-04
基金资助:*本文系国家自然科学基金项目“中文网络文本的地理实体语义关系标注与评价”(项目编号: 41801320)、国家社会科学基金项目“基于开放获取学术期刊的资源深度整合与揭示研究”(项目编号: 16BTQ025)和中国科学院文献情报中心青年创新团队项目“基于机器学习的科研指纹识别方法研究”(项目编号: 馆1724)的研究成果之一
引用本文:   
余丽,钱力,付常雷,赵华茗. 基于深度学习的文本中细粒度知识元抽取方法研究*[J]. 数据分析与知识发现, 2019, 3(1): 38-45.
Li Yu,Li Qian,Changlei Fu,Huaming Zhao. Extracting Fine-grained Knowledge Units from Texts with Deep Learning. Data Analysis and Knowledge Discovery, 2019, 3(1): 38-45.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.1352      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2019/V3/I1/38
[1] 高继平, 丁堃, 潘云涛, 等. 知识元研究述评[J]. 情报理论与实践, 2015, 38(7): 134-138.
[1] (Gao Jiping, Ding Kun, Pan Yuntao, et al.A Review of Knowledge Unit Research[J]. Information studies: Theory & Application, 2015, 38(7): 134-138.)
[2] 钱力, 张晓林, 王茜. 基于科技文献的研究设计指纹描述框架研究[J]. 大学图书馆学报, 2015, 33(1): 14-20.
[2] (Qian Li, Zhang Xiaolin, Wang Qian.Research Design Fingerprint Description Framework Based on Scientific Papers[J]. Journal of Academic Libraries, 2015, 33(1): 14-20.)
[3] 刘则渊. 知识图谱的若干问题思考[R]. 大连: 大连理工大学WISE实验室, 2010.
[3] (Liu Zeyuan.Some Thoughts on Knowledge Graph[R]. Dalian: WISE Laboratory of Dalian University of Technology, 2010.)
[4] 祝清松, 冷伏海. 基于引文内容分析的高被引论文主题识别研究[J]. 中国图书馆学报, 2014, 40(1): 39-49.
[4] (Zhu Qingsong, Leng Fuhai.Topic Identification of Highly Cited Papers Based on Citation Content Analysis[J]. Journal of Library Science in China, 2014, 40(1): 39-49.)
[5] 王子璇, 乐小虬, 何远标. 基于WMD语义相似度的TextRank改进算法识别论文核心主题句研究[J]. 数据分析与知识发现, 2017, 1(4): 1-8.
[5] (Wang Zixuan, Le Xiaoqiu, He Yuanbiao.Recognizing Core Topic Sentences with Improved TextRank Algorithm Based on WMD Semantic Similarity[J]. Data Analysis and Knowledge Discovery, 2017, 1(1): 39-49.)
[6] 丁恒, 陆伟. 标准文献知识服务系统设计与实现[J]. 现代图书情报技术, 2016(7-8): 120-128.
[6] (Ding Heng, Lu Wei.Building Standard Literature Knowledge Service System[J]. New Technology of Library and Information Service, 2016(7-8): 120-128.)
[7] Augenstein I, Das M, Riedel S, et al. SemEval2017 Task 10: ScienceIE-Extracting Keyphrases and Relations from Scientific Publications[C]//Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 2017.
[8] 曾文, 徐硕, 张运良, 等. 科技文献术语的自动抽取技术研究与分析[J]. 现代图书情报技术, 2014(1):51-55.
[8] (Zeng Wen, Xu Shuo, Zhang Yunliang, et al.The Research and Analysis on Automatic Extraction of Science and Technology Literature Terms[J]. New Technology of Library and Information Service, 2014(1): 51-55.)
[9] Gupta S, Manning C D.Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers[C]// Proceedings of the 5th International Joint Conference on Natural Language Processing, 2011.
[10] 郭红梅, 孔贝贝, 张智雄. 基于多重文本关系图中clique子团聚类的主题识别方法研究[J]. 情报学报, 2017,36(5): 433-442.
[10] (Guo Hongmei, Kong Beibei, Zhang Zhixiong.Study on Textual Topic Identification by Clustering Clique Structure in Multi-Relationship Text Graph[J]. Journal of the China Society for Scientific and Technical Information, 2017, 36(5): 433-442.)
[11] 秦晓慧, 乐小虬. 面向单篇文献引文网络的主题来源与走向追踪[J]. 现代图书情报技术, 2015(9): 52-59.
[11] (Qin Xiaohui, Le Xiaoqiu.Topic Sources and Trends Tracking Towards Citation Network of Single Paper[J]. New Technology of Library and Information Service, 2015(9): 52-59.)
[12] Tateisi Y.Annotation of Computer Science Papers for Semantic Relation Extraction[C]//Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14), 2014.
[13] Zadeh B Q, Schumann A K.The ACL RD-TEC 2.0: A Language Resource for Evaluating Term Extraction and Entity Recognition Methods[C]//Proceedings of the Language Resources & Evaluation Conference. 2016.
[14] 钱力, 张晓林, 王茜. 科技论文的研究设计指纹自动识别方法构建与实现[J]. 图书情报工作, 2018, 62(2): 135-143.
[14] (Qian Li, Zhang Xiaolin, Wang Qian.Building and Implement on Automatic Identification Method of Research Design Fingerprint of Scientific Papers[J]. Library and Information Service, 2018, 62(2): 135-143.)
[15] 郭少卿, 乐小虬. 科技论文中数值指标实际取值识别[J]. 数据分析与知识发现, 2018, 2(1): 21-28.
[15] (Guo Shaoqing, Le Xiaoqiu.Identifying Actual Value of Numerical Indicator from Scientific Paper[J]. Data Analysis and Knowledge Discovery, 2018, 2(1): 21-28.)
[16] Dan S, Agarwal S, Singh M, et al.Which Techniques does Your Application Use?: An Information Extraction Framework for Scientific Articles[OL]. ArXiv Preprint, arXiv: 1608.06386.
[17] Singh M, Dan S, Agarwal S,et al.App TechMiner: Minging Applications and Techniques from Scientific Articles[C]// Proceedings of the 6th International Workshop on Mining Scientific Publications. 2017: 1-8.
[18] Tsai C T, Kundu G, Roth D.Concept-based Analysis of Scientific Literature[C]//Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 2013: 1733-1738.
[19] 周雷, 李颖, 石崇德. 面向技术机会发现TOD的专利信息抽取——韩国科学技术信息研究院KISTI语义服务[J]. 情报工程, 2015, 1(2): 31-37.
[19] (Zhou Lei, Li Ying, Shi Chongde.Patent Information Extraction for Technology Opportunity Discovery[J]. Technology Intelligence Engineering, 2015, 1(2): 31-37.)
[20] Tseng H Y, Liao S G, Lu C C, et al.Measuring Efficiencies of Incubation Centers in Taiwan: An Application of Text Mining and Data Envelopment Analysis[J]. Transylvanian Review, 2017, 18: 75.
[21] Lin W, Ji D, Lu Y.Disorder Recognition in Clinical Texts Using Multi-label Structured SVM[J]. BMC Bioinformatics, 2017, 18(1): 75.
[22] 杨娅, 杨志豪, 林鸿飞, 等. MBNER: 面向生物医学领域的多种实体识别系统[J]. 中文信息学报, 2016, 30(1): 170-182.
[22] (Yang Ya, Yang Zhihao, Lin Hongfei.MBNER: Multiple Biomedical Named Entity Recognition System for Biomedical Literature[J]. Journal of Chinese Information Processing, 2016, 30(1): 170-182.)
[23] Okamoto M, Shan Z, Orihara R.Applying Information Extraction for Patent Structure Analysis[C] //Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017: 989-992.
[24] Wagstaff K L, Francis R, Gowda T, et al.Mars Target Encyclopedia: Rock and Soil Composition Extracted from the Literature[C]// Proceedings of the 30th Annual Conference on Innovative Applications of Artificial Intelligence, 2018.
[25] Basaldella M, Antolli E, Serra G,, et al.Bidirectional LSTM Recurrent Neural Network for Keyphrase Extraction[C]// Proceedings of the Italian Research Conference on Digital Libraries, 2018: 180-187.
[26] 朱丹浩, 杨蕾, 王东波. 基于深度学习的中文机构名识别研究——一种汉字级别的循环神经网络方法[J]. 现代图书情报技术, 2016(12): 36-43.
[26] (Zhu Danhao, Yang Lei, Wang Dongbo.Recognizing Chinese Organization Names Based on Deep Learning: A Recurrent Network Model[J]. New Technology of Library and Information Service, 2016(12): 36-43.)
[27] Rei M, Crichton G, Pyysalo S.Attending to Characters in Neural Sequence Labeling Models[C]//Proceedings of the 26th International Conference on Computational Linguistics (COLING-2016), 2016.
[1] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[2] 柴庆凤, 史霖炎, 梅珊, 熊海涛, 贺惠新. 基于人工特征和机器特征融合的科技文献知识元抽取*[J]. 数据分析与知识发现, 2021, 5(8): 132-144.
[3] 徐月梅, 王子厚, 吴子歆. 一种基于CNN-BiLSTM多特征融合的股票走势预测模型*[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[4] 王昊, 林克柔, 孟镇, 李心蕾. 文本表示及其特征生成对法律判决书中多类型实体识别的影响分析[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[5] 赵丹宁,牟冬梅,白森. 基于深度学习的科技文献摘要结构要素自动抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[6] 黄名选,蒋曹清,卢守东. 基于词嵌入与扩展词交集的查询扩展*[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[7] 钟佳娃,刘巍,王思丽,杨恒. 文本情感分析方法及应用综述*[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[8] 张国标,李洁. 融合多模态内容语义一致性的社交媒体虚假新闻检测*[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[9] 马莹雪,甘明鑫,肖克峻. 融合标签和内容信息的矩阵分解推荐方法*[J]. 数据分析与知识发现, 2021, 5(5): 71-82.
[10] 常城扬,王晓东,张胜磊. 基于深度学习方法对特定群体推特的动态政治情感极性分析*[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[11] 冯勇,刘洋,徐红艳,王嵘冰,张永刚. 融合近邻评论的GRU商品推荐模型*[J]. 数据分析与知识发现, 2021, 5(3): 78-87.
[12] 胡昊天,吉晋锋,王东波,邓三鸿. 基于深度学习的食品安全事件实体一体化呈现平台构建*[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[13] 张琪,江川,纪有书,冯敏萱,李斌,许超,刘浏. 面向多领域先秦典籍的分词词性一体化自动标注模型构建*[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[14] 吕学强,罗艺雄,李家全,游新冬. 中文专利侵权检测研究综述*[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[15] 成彬,施水才,都云程,肖诗斌. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn