Please wait a minute...
New Technology of Library and Information Service  2016, Vol. 32 Issue (10): 91-97    DOI: 10.11925/infotech.1003-3513.2016.10.10
Orginal Article Current Issue | Archive | Adv Search |
Extracting Chemical and Disease Named Entities with Multiple-Feature CRF Model
Sui Mingshuang,Cui Lei()
School of Medical Informatics, China Medical University, Shenyang 110122, China
Download: PDF(522 KB)   HTML ( 46
Export: BibTeX | EndNote (RIS)      

[Objective] This study aims to build a CRF model with multiple features, which could automatically extract chemical and disease named entities from biomedical documents. [Methods] We compared the performance of popular named entity recognition features, including lexical features, domain knowledge features, dictionary matching features as well as unsupervised learning features, and then optimized the new model. [Results] We built the final CRF model with lexical features, dictionary matching features, unsupervised learning features and part of the domain knowledge features. The precision, recall, and F-score for chemical entities identification tasks were 97.33%, 80.76%, and 88.27, respectively. For disease entities, they were 84.20%, 81.96%, and 83.07%, respectively. [Limitations] Chemical and disease entities may interfere with each other while being identified simultaneously. The deleted domain knowledge features may contain valuable information. [Conclusions] This study proposed a new method to identify biomedical named entities, which could be further improved.

Key wordsNamed entity recognition      CRF      Text mining      Unsupervised learning     
Received: 24 June 2016      Published: 23 November 2016

Cite this article:

Sui Mingshuang,Cui Lei. Extracting Chemical and Disease Named Entities with Multiple-Feature CRF Model. New Technology of Library and Information Service, 2016, 32(10): 91-97.

URL:     OR

[1] Wei C H, Peng Y, Leaman al.Overview of the BioCreative V Chemical Disease Relation (CDR) Task[C]. In: Proceedings of the 5th BioCreative Challenge Evaluation Workshop. 2015.
[2] 隋明爽, 崔雷. 用文本挖掘方法发现药物的副作用[J]. 中华医学图书情报杂志, 2015, 24(11): 67-72.
[2] (Sui Mingshuang, Cui Lei.Detection of Drug Adverse Effects by Text-mining[J]. Chinese Journal of Medical Library and Information Science, 2015, 24(11): 67-72.)
[3] 徐博, 林鸿飞, 杨志豪. 基于模板抽取和丰富特征的药名词典生成[C].见: 第五届全国信息检索学术会议论文集.2009.
[3] (Xu Bo, Lin Hongfei, Yang Zhihao.Generating a Drug Name Dictionary Based on Pattern Extraction and Rich Feature Sets[C]. In: Proceedings of the 5th China Conference on Information Retrieval. 2009.)
[4] Tikk D, Solt L.Improving Textual Medication Extraction Using Combined Conditional Random Fields and Rule-based Systems[J]. Journal of the American Medical Informatics Association, 2010, 17(5): 540-544.
[5] 何林娜, 杨志豪, 林鸿飞, 等. 基于特征耦合泛化的药名实体识别[J]. 中文信息学报, 2014, 28(2): 72-77.
[5] (He Linna, Yang Zhihao, Lin Hongfei, et al.Drug Name Entity Recognition Based on Feature Coupling Generalization[J]. Journal of Chinese Information Processing, 2014, 28(2): 72-77.)
[6] Krauthammer M, Nenadic G.Term Identification in the Biomedical Literature[J]. Journal of Biomedical Informatics, 2004, 37(6): 512-526.
[7] Lafferty J, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]. In: Proceedings of the 2002 International Conference on Machine Learning. 2002.
[8] Chowdhury Md F M, Lavelli A. Disease Mention Recognition with Specific Features [C]. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010.
[9] Lee H C, Hsu Y Y, Kao H Y.An Enhanced CRF-based System for Disease Name Entity Recognition and Normalization on BioCreative V DNER Task [C]. In: Proceedings of the 5th BioCreative Challenge Evaluation Workshop. 2015.
[10] Lowe D M, Sayle R A.LeadMine: A Grammar and Dictionary Driven Approach to Entity Recognition[J]. Journal of Cheminformatics, 2015, 7(S1): 1-9.
[11] Leaman R, Wei C H, Lu Z. tmChem: A High Performance Approach for Chemical Named Entity Recognition and Normalization[J]. Journal of Cheminformatics, 2015, 7(S1): 1-10.
[12] Leaman R, Islamaj Dogan R, Lu Z.DNorm: Disease Name Normalization with Pairwise Learning to Rank[J]. Bioinformatics, 2013, 29(22): 2909-2917.
[13] Do?an R I, Leaman R, Lu Z.NCBI Disease Corpus: A Resource for Disease Name Recognition and Concept Normalization[J]. Journal of Biomedical Informatics, 2014, 47(2): 1-10.
[14] Li J, Sun Y, Johnson R J, et al.Annotating Chemicals, Diseases and Their Interactions in Biomedical Literature [C]. In: Proceedings of the 5th BioCreative Challenge Evaluation Workshop. 2015.
[15] Kim J D, Ohta T, Tateisi Y, et al.GENIA Corpus-- Semantically Annotated Corpus for Bio-textmining[J]. Bioinformatics, 2003, 19(S1): 180-182.
[16] 夏光辉. 基于词典与机器学习的基因命名实体识别机制研究[D]. 北京: 北京协和医学院, 2013.
[16] (Xia Guanghui.The Research of Gene Name Entity Recognition Mechanism by Combining Dictionary Method and Machine Learning Method [D]. Beijing: Peking Union Medical College, 2013.)
[17] Zhang Y, Xu J, Chen H, et al. Chemical Named Entity Recognition in Patents by Domain Knowledge and Unsupervised Feature Learning [J/OL]. The Journal of Biological Databases and Curation [2016-06-10]. .
[18] 何红磊. 基于词表示方法的生物医学命名实体识别[D]. 大连: 大连理工大学, 2015.
[18] (He Honglei.Research of Word Representations on Biomedical Named Entity Recognition [D]. Dalian: Dalian University of Technology, 2015.)
[19] Wu Y, Xu J, Jiang M, et al.A Study of Neural Word Embeddings for Named Entity Recognition in Clinical Text [C]. In: Proceedings of the 2015 AMIA Annual Symposium. 2015.
[20] Brown P F, Desouza P V, Mercer R L, et al.Class-based N-gram Models of Natural Language[J]. Computational Linguistics, 1992, 18(4): 467-479.
[1] Yanan Yang,Wenhui Zhao,Jian Zhang,Shen Tan,Beibei Zhang. Visualizing Policy Texts Based on Multi-View Collaboration[J]. 数据分析与知识发现, 2019, 3(6): 30-41.
[2] Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[3] Mengji Zhang,Wanyu Du,Nan Zheng. Predicting Stock Trends Based on News Events[J]. 数据分析与知识发现, 2019, 3(5): 11-18.
[4] Xiaoxiao Zhu,Zunqi Yang,Jing Liu. Construction of an Adverse Drug Reaction Extraction Model Based on Bi-LSTM and CRF[J]. 数据分析与知识发现, 2019, 3(2): 90-97.
[5] Li Yu,Li Qian,Changlei Fu,Huaming Zhao. Extracting Fine-grained Knowledge Units from Texts with Deep Learning[J]. 数据分析与知识发现, 2019, 3(1): 38-45.
[6] Huihui Tang,Hao Wang,Zixuan Zhang,Xueying Wang. Extracting Names of Historical Events Based on Chinese Character Tags[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[7] Ning Zhang,Lemin Yin,Lifeng He. Impacts of “Poster-Follower” Sentiment on Stock Market Performance[J]. 数据分析与知识发现, 2018, 2(6): 1-12.
[8] Guoming Feng,Xiaodong Zhang,Suhui Liu. DBLC Model for Word Segmentation Based on Autonomous Learning[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[9] Huiying Qi,Jianguang Guo. Integrating Multi-Source Clinical Research Data Based on CDISC Standard[J]. 数据分析与知识发现, 2018, 2(5): 88-93.
[10] Xinyue Fan,Lei Cui. Using Text Mining to Discover Drug Side Effects: Case Study of PubMed[J]. 数据分析与知识发现, 2018, 2(3): 79-86.
[11] Qiangbing Wang,Chengzhi Zhang. Constructing Users Profiles with Content and Gesture Behaviors[J]. 数据分析与知识发现, 2017, 1(2): 80-86.
[12] Xiufang Xie,Xiaolin Zhang. Integrated Analysis and Visualization of Sci-Tech Roadmaps: Case Study of Renewable Energy[J]. 数据分析与知识发现, 2017, 1(1): 16-25.
[13] Yao Zhaoxu,Ma Jing. Extracting Topic and Opinion from Microblog Posts with New Algorithm[J]. 现代图书情报技术, 2016, 32(7-8): 78-86.
[14] Wang Miping,Wang Hao,Deng Sanhong,Wu Zhixiang. Extracting Chinese Metallurgy Patent Terms with Conditional Random Fields[J]. 现代图书情报技术, 2016, 32(6): 28-36.
[15] Lan Qiujun,Liu Wenxing,Li Weikang,Hu Xingye. Sentiment Analysis of Financial Forum Textual Message[J]. 现代图书情报技术, 2016, 32(4): 64-71.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938