Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (10): 91-97     https://doi.org/10.11925/infotech.1003-3513.2016.10.10
  应用论文 本期目录 | 过刊浏览 | 高级检索 |
结合多种特征的CRF模型用于化学物质-疾病命名实体识别
隋明爽,崔雷()
中国医科大学医学信息学院 沈阳 110122
Extracting Chemical and Disease Named Entities with Multiple-Feature CRF Model
Sui Mingshuang,Cui Lei()
School of Medical Informatics, China Medical University, Shenyang 110122, China
全文: PDF (522 KB)   HTML ( 49
输出: BibTeX | EndNote (RIS)      
摘要 

目的】建立结合多种特征的条件随机场模型, 探索从大型生物医学文本中同时自动提取化学物质和疾病实体的方法。【方法】结合命名实体识别特征, 包括词法特征、领域知识特征、词典匹配特征和无监督学习特征等, 比较不同特征对命名实体识别的效果, 并优化模型。【结果】CRF模型纳入词法特征、词典匹配特征、无监督学习特征和部分领域知识特征, 化学物质识别准确率97.33%、召回率80.76%、F值88.27%, 疾病实体识别准确率为84.20%、召回率为81.96%、F值为83.07%。【局限】同时识别化学物质和疾病实体可能存在互相干扰, 删除的部分领域特征可能含有有用信息。【结论】本研究可为生物医学命名实体识别的特征选择提供参考, 同时仍需优化特征以获得更好的识别效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
隋明爽
崔雷
关键词 命名实体识别条件随机场文本挖掘无监督学习    
Abstract

[Objective] This study aims to build a CRF model with multiple features, which could automatically extract chemical and disease named entities from biomedical documents. [Methods] We compared the performance of popular named entity recognition features, including lexical features, domain knowledge features, dictionary matching features as well as unsupervised learning features, and then optimized the new model. [Results] We built the final CRF model with lexical features, dictionary matching features, unsupervised learning features and part of the domain knowledge features. The precision, recall, and F-score for chemical entities identification tasks were 97.33%, 80.76%, and 88.27, respectively. For disease entities, they were 84.20%, 81.96%, and 83.07%, respectively. [Limitations] Chemical and disease entities may interfere with each other while being identified simultaneously. The deleted domain knowledge features may contain valuable information. [Conclusions] This study proposed a new method to identify biomedical named entities, which could be further improved.

Key wordsNamed entity recognition    CRF    Text mining    Unsupervised learning
收稿日期: 2016-06-24      出版日期: 2016-11-23
引用本文:   
隋明爽,崔雷. 结合多种特征的CRF模型用于化学物质-疾病命名实体识别[J]. 现代图书情报技术, 2016, 32(10): 91-97.
Sui Mingshuang,Cui Lei. Extracting Chemical and Disease Named Entities with Multiple-Feature CRF Model. New Technology of Library and Information Service, 2016, 32(10): 91-97.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.10.10      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2016/V32/I10/91
[1] Wei C H, Peng Y, Leaman R.et al.Overview of the BioCreative V Chemical Disease Relation (CDR) Task[C]. In: Proceedings of the 5th BioCreative Challenge Evaluation Workshop. 2015.
[2] 隋明爽, 崔雷. 用文本挖掘方法发现药物的副作用[J]. 中华医学图书情报杂志, 2015, 24(11): 67-72.
[2] (Sui Mingshuang, Cui Lei.Detection of Drug Adverse Effects by Text-mining[J]. Chinese Journal of Medical Library and Information Science, 2015, 24(11): 67-72.)
[3] 徐博, 林鸿飞, 杨志豪. 基于模板抽取和丰富特征的药名词典生成[C].见: 第五届全国信息检索学术会议论文集.2009.
[3] (Xu Bo, Lin Hongfei, Yang Zhihao.Generating a Drug Name Dictionary Based on Pattern Extraction and Rich Feature Sets[C]. In: Proceedings of the 5th China Conference on Information Retrieval. 2009.)
[4] Tikk D, Solt L.Improving Textual Medication Extraction Using Combined Conditional Random Fields and Rule-based Systems[J]. Journal of the American Medical Informatics Association, 2010, 17(5): 540-544.
[5] 何林娜, 杨志豪, 林鸿飞, 等. 基于特征耦合泛化的药名实体识别[J]. 中文信息学报, 2014, 28(2): 72-77.
[5] (He Linna, Yang Zhihao, Lin Hongfei, et al.Drug Name Entity Recognition Based on Feature Coupling Generalization[J]. Journal of Chinese Information Processing, 2014, 28(2): 72-77.)
[6] Krauthammer M, Nenadic G.Term Identification in the Biomedical Literature[J]. Journal of Biomedical Informatics, 2004, 37(6): 512-526.
[7] Lafferty J, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]. In: Proceedings of the 2002 International Conference on Machine Learning. 2002.
[8] Chowdhury Md F M, Lavelli A. Disease Mention Recognition with Specific Features [C]. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010.
[9] Lee H C, Hsu Y Y, Kao H Y.An Enhanced CRF-based System for Disease Name Entity Recognition and Normalization on BioCreative V DNER Task [C]. In: Proceedings of the 5th BioCreative Challenge Evaluation Workshop. 2015.
[10] Lowe D M, Sayle R A.LeadMine: A Grammar and Dictionary Driven Approach to Entity Recognition[J]. Journal of Cheminformatics, 2015, 7(S1): 1-9.
[11] Leaman R, Wei C H, Lu Z. tmChem: A High Performance Approach for Chemical Named Entity Recognition and Normalization[J]. Journal of Cheminformatics, 2015, 7(S1): 1-10.
[12] Leaman R, Islamaj Dogan R, Lu Z.DNorm: Disease Name Normalization with Pairwise Learning to Rank[J]. Bioinformatics, 2013, 29(22): 2909-2917.
[13] Do?an R I, Leaman R, Lu Z.NCBI Disease Corpus: A Resource for Disease Name Recognition and Concept Normalization[J]. Journal of Biomedical Informatics, 2014, 47(2): 1-10.
[14] Li J, Sun Y, Johnson R J, et al.Annotating Chemicals, Diseases and Their Interactions in Biomedical Literature [C]. In: Proceedings of the 5th BioCreative Challenge Evaluation Workshop. 2015.
[15] Kim J D, Ohta T, Tateisi Y, et al.GENIA Corpus-- Semantically Annotated Corpus for Bio-textmining[J]. Bioinformatics, 2003, 19(S1): 180-182.
[16] 夏光辉. 基于词典与机器学习的基因命名实体识别机制研究[D]. 北京: 北京协和医学院, 2013.
[16] (Xia Guanghui.The Research of Gene Name Entity Recognition Mechanism by Combining Dictionary Method and Machine Learning Method [D]. Beijing: Peking Union Medical College, 2013.)
[17] Zhang Y, Xu J, Chen H, et al. Chemical Named Entity Recognition in Patents by Domain Knowledge and Unsupervised Feature Learning [J/OL]. The Journal of Biological Databases and Curation [2016-06-10]. .
[18] 何红磊. 基于词表示方法的生物医学命名实体识别[D]. 大连: 大连理工大学, 2015.
[18] (He Honglei.Research of Word Representations on Biomedical Named Entity Recognition [D]. Dalian: Dalian University of Technology, 2015.)
[19] Wu Y, Xu J, Jiang M, et al.A Study of Neural Word Embeddings for Named Entity Recognition in Clinical Text [C]. In: Proceedings of the 2015 AMIA Annual Symposium. 2015.
[20] Brown P F, Desouza P V, Mercer R L, et al.Class-based N-gram Models of Natural Language[J]. Computational Linguistics, 1992, 18(4): 467-479.
[1] 王昊, 林克柔, 孟镇, 李心蕾. 文本表示及其特征生成对法律判决书中多类型实体识别的影响分析[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[2] 黄名选,蒋曹清,卢守东. 基于词嵌入与扩展词交集的查询扩展*[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[3] 许光,任明,宋城宇. 西方媒体新闻中的中国经济形象提取*[J]. 数据分析与知识发现, 2021, 5(5): 30-40.
[4] 代冰,胡正银. 基于文献的知识发现新近研究综述 *[J]. 数据分析与知识发现, 2021, 5(4): 1-12.
[5] 成彬,施水才,都云程,肖诗斌. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[6] 余传明, 王曼怡, 林虹君, 朱星宇, 黄婷婷, 安璐. 基于深度学习的词汇表示模型对比研究*[J]. 数据分析与知识发现, 2020, 4(8): 28-40.
[7] 徐晨飞, 叶海影, 包平. 基于深度学习的方志物产资料实体自动识别模型构建研究*[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[8] 夏天. 面向中文学术文本的单文档关键短语抽取 *[J]. 数据分析与知识发现, 2020, 4(7): 76-86.
[9] 赵平,孙连英,涂帅,卞建玲,万莹. 改进的知识迁移景点实体识别算法研究及应用*[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
[10] 李成梁,赵中英,李超,亓亮,温彦. 基于依存关系嵌入与条件随机场的商品属性抽取方法*[J]. 数据分析与知识发现, 2020, 4(5): 54-65.
[11] 高原,施元磊,张蕾,曹天奕,冯筠. 基于游记文本的游客游览行程重构*[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
[12] 余传明,钟韵辞,林奥琛,安璐. 基于网络表示学习的作者重名消歧研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 48-59.
[13] 马建霞,袁慧,蒋翔. 基于Bi-LSTM+CRF的科学文献中生态治理技术相关命名实体抽取研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 78-88.
[14] 杜建. 医学知识不确定性测度的进展与展望*[J]. 数据分析与知识发现, 2020, 4(10): 14-27.
[15] 刘婧茹,宋阳,贾睿,张翼鹏,罗勇,马敬东. 基于BiLSTM-CRF中文临床文本中受保护的健康信息识别*[J]. 数据分析与知识发现, 2020, 4(10): 124-133.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn