Please wait a minute...
New Technology of Library and Information Service  2016, Vol. 32 Issue (10): 91-97    DOI: 10.11925/infotech.1003-3513.2016.10.10
Orginal Article Current Issue | Archive | Adv Search |
Extracting Chemical and Disease Named Entities with Multiple-Feature CRF Model
Sui Mingshuang,Cui Lei()
School of Medical Informatics, China Medical University, Shenyang 110122, China
Export: BibTeX | EndNote (RIS)      

[Objective] This study aims to build a CRF model with multiple features, which could automatically extract chemical and disease named entities from biomedical documents. [Methods] We compared the performance of popular named entity recognition features, including lexical features, domain knowledge features, dictionary matching features as well as unsupervised learning features, and then optimized the new model. [Results] We built the final CRF model with lexical features, dictionary matching features, unsupervised learning features and part of the domain knowledge features. The precision, recall, and F-score for chemical entities identification tasks were 97.33%, 80.76%, and 88.27, respectively. For disease entities, they were 84.20%, 81.96%, and 83.07%, respectively. [Limitations] Chemical and disease entities may interfere with each other while being identified simultaneously. The deleted domain knowledge features may contain valuable information. [Conclusions] This study proposed a new method to identify biomedical named entities, which could be further improved.

Key wordsNamed entity recognition      CRF      Text mining      Unsupervised learning     
Received: 24 June 2016      Published: 23 November 2016

Cite this article:

Sui Mingshuang,Cui Lei. Extracting Chemical and Disease Named Entities with Multiple-Feature CRF Model. New Technology of Library and Information Service, 2016, 32(10): 91-97.

URL:     OR

[1] Wei C H, Peng Y, Leaman al.Overview of the BioCreative V Chemical Disease Relation (CDR) Task[C]. In: Proceedings of the 5th BioCreative Challenge Evaluation Workshop. 2015.
[2] 隋明爽, 崔雷. 用文本挖掘方法发现药物的副作用[J]. 中华医学图书情报杂志, 2015, 24(11): 67-72.
[2] (Sui Mingshuang, Cui Lei.Detection of Drug Adverse Effects by Text-mining[J]. Chinese Journal of Medical Library and Information Science, 2015, 24(11): 67-72.)
[3] 徐博, 林鸿飞, 杨志豪. 基于模板抽取和丰富特征的药名词典生成[C].见: 第五届全国信息检索学术会议论文集.2009.
[3] (Xu Bo, Lin Hongfei, Yang Zhihao.Generating a Drug Name Dictionary Based on Pattern Extraction and Rich Feature Sets[C]. In: Proceedings of the 5th China Conference on Information Retrieval. 2009.)
[4] Tikk D, Solt L.Improving Textual Medication Extraction Using Combined Conditional Random Fields and Rule-based Systems[J]. Journal of the American Medical Informatics Association, 2010, 17(5): 540-544.
[5] 何林娜, 杨志豪, 林鸿飞, 等. 基于特征耦合泛化的药名实体识别[J]. 中文信息学报, 2014, 28(2): 72-77.
[5] (He Linna, Yang Zhihao, Lin Hongfei, et al.Drug Name Entity Recognition Based on Feature Coupling Generalization[J]. Journal of Chinese Information Processing, 2014, 28(2): 72-77.)
[6] Krauthammer M, Nenadic G.Term Identification in the Biomedical Literature[J]. Journal of Biomedical Informatics, 2004, 37(6): 512-526.
[7] Lafferty J, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]. In: Proceedings of the 2002 International Conference on Machine Learning. 2002.
[8] Chowdhury Md F M, Lavelli A. Disease Mention Recognition with Specific Features [C]. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010.
[9] Lee H C, Hsu Y Y, Kao H Y.An Enhanced CRF-based System for Disease Name Entity Recognition and Normalization on BioCreative V DNER Task [C]. In: Proceedings of the 5th BioCreative Challenge Evaluation Workshop. 2015.
[10] Lowe D M, Sayle R A.LeadMine: A Grammar and Dictionary Driven Approach to Entity Recognition[J]. Journal of Cheminformatics, 2015, 7(S1): 1-9.
[11] Leaman R, Wei C H, Lu Z. tmChem: A High Performance Approach for Chemical Named Entity Recognition and Normalization[J]. Journal of Cheminformatics, 2015, 7(S1): 1-10.
[12] Leaman R, Islamaj Dogan R, Lu Z.DNorm: Disease Name Normalization with Pairwise Learning to Rank[J]. Bioinformatics, 2013, 29(22): 2909-2917.
[13] Do?an R I, Leaman R, Lu Z.NCBI Disease Corpus: A Resource for Disease Name Recognition and Concept Normalization[J]. Journal of Biomedical Informatics, 2014, 47(2): 1-10.
[14] Li J, Sun Y, Johnson R J, et al.Annotating Chemicals, Diseases and Their Interactions in Biomedical Literature [C]. In: Proceedings of the 5th BioCreative Challenge Evaluation Workshop. 2015.
[15] Kim J D, Ohta T, Tateisi Y, et al.GENIA Corpus-- Semantically Annotated Corpus for Bio-textmining[J]. Bioinformatics, 2003, 19(S1): 180-182.
[16] 夏光辉. 基于词典与机器学习的基因命名实体识别机制研究[D]. 北京: 北京协和医学院, 2013.
[16] (Xia Guanghui.The Research of Gene Name Entity Recognition Mechanism by Combining Dictionary Method and Machine Learning Method [D]. Beijing: Peking Union Medical College, 2013.)
[17] Zhang Y, Xu J, Chen H, et al. Chemical Named Entity Recognition in Patents by Domain Knowledge and Unsupervised Feature Learning [J/OL]. The Journal of Biological Databases and Curation [2016-06-10]. .
[18] 何红磊. 基于词表示方法的生物医学命名实体识别[D]. 大连: 大连理工大学, 2015.
[18] (He Honglei.Research of Word Representations on Biomedical Named Entity Recognition [D]. Dalian: Dalian University of Technology, 2015.)
[19] Wu Y, Xu J, Jiang M, et al.A Study of Neural Word Embeddings for Named Entity Recognition in Clinical Text [C]. In: Proceedings of the 2015 AMIA Annual Symposium. 2015.
[20] Brown P F, Desouza P V, Mercer R L, et al.Class-based N-gram Models of Natural Language[J]. Computational Linguistics, 1992, 18(4): 467-479.
[1] Wang Hao, Lin Kerou, Meng Zhen, Li Xinlei. Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[2] Yu Xuehan, He Lin, Xu Jian. Extracting Events from Ancient Books Based on RoBERTa-CRF[J]. 数据分析与知识发现, 2021, 5(7): 26-35.
[3] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[4] Xu Guang,Ren Ming,Song Chengyu. Extracting China’s Economic Image from Western News[J]. 数据分析与知识发现, 2021, 5(5): 30-40.
[5] Dai Bing,Hu Zhengyin. Review of Studies on Literature-Based Discovery[J]. 数据分析与知识发现, 2021, 5(4): 1-12.
[6] Hu Haotian,Ji Jinfeng,Wang Dongbo,Deng Sanhong. An Integrated Platform for Food Safety Incident Entities Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[7] Yu Chuanming, Wang Manyi, Lin Hongjun, Zhu Xingyu, Huang Tingting, An Lu. A Comparative Study of Word Representation Models Based on Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 28-40.
[8] Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[9] Xia Tian. Extracting Key-phrases from Chinese Scholarly Papers[J]. 数据分析与知识发现, 2020, 4(7): 76-86.
[10] Gao Yuan,Shi Yuanlei,Zhang Lei,Cao Tianyi,Feng Jun. Reconstructing Tour Routes Based on Travel Notes[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
[11] Xue Fuliang,Liu Lifang. Fine-Grained Sentiment Analysis with CRF and ATAE-LSTM[J]. 数据分析与知识发现, 2020, 4(2/3): 207-213.
[12] Yu Chuanming,Zhong Yunci,Lin Aochen,An Lu. Author Name Disambiguation with Network Embedding[J]. 数据分析与知识发现, 2020, 4(2/3): 48-59.
[13] Ma Jianxia,Yuan Hui,Jiang Xiang. Extracting Name Entities from Ecological Restoration Literature with Bi-LSTM+CRF[J]. 数据分析与知识发现, 2020, 4(2/3): 78-88.
[14] Du Jian. Measuring Uncertainty of Medical Knowledge: A Literature Review[J]. 数据分析与知识发现, 2020, 4(10): 14-27.
[15] Liu Jingru,Song Yang,Jia Rui,Zhang Yipeng,Luo Yong,Ma Jingdong. A BiLSTM-CRF Model for Protected Health Information in Chinese[J]. 数据分析与知识发现, 2020, 4(10): 124-133.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938