[Objective] This study aims to build a CRF model with multiple features, which could automatically extract chemical and disease named entities from biomedical documents. [Methods] We compared the performance of popular named entity recognition features, including lexical features, domain knowledge features, dictionary matching features as well as unsupervised learning features, and then optimized the new model. [Results] We built the final CRF model with lexical features, dictionary matching features, unsupervised learning features and part of the domain knowledge features. The precision, recall, and F-score for chemical entities identification tasks were 97.33%, 80.76%, and 88.27, respectively. For disease entities, they were 84.20%, 81.96%, and 83.07%, respectively. [Limitations] Chemical and disease entities may interfere with each other while being identified simultaneously. The deleted domain knowledge features may contain valuable information. [Conclusions] This study proposed a new method to identify biomedical named entities, which could be further improved.
隋明爽,崔雷. 结合多种特征的CRF模型用于化学物质-疾病命名实体识别[J]. 现代图书情报技术, 2016, 32(10): 91-97.
Sui Mingshuang,Cui Lei. Extracting Chemical and Disease Named Entities with Multiple-Feature CRF Model. New Technology of Library and Information Service, 2016, 32(10): 91-97.
Wei C H, Peng Y, Leaman R.et al.Overview of the BioCreative V Chemical Disease Relation (CDR) Task[C]. In: Proceedings of the 5th BioCreative Challenge Evaluation Workshop. 2015.
(Sui Mingshuang, Cui Lei.Detection of Drug Adverse Effects by Text-mining[J]. Chinese Journal of Medical Library and Information Science, 2015, 24(11): 67-72.)
(Xu Bo, Lin Hongfei, Yang Zhihao.Generating a Drug Name Dictionary Based on Pattern Extraction and Rich Feature Sets[C]. In: Proceedings of the 5th China Conference on Information Retrieval. 2009.)
[4]
Tikk D, Solt L.Improving Textual Medication Extraction Using Combined Conditional Random Fields and Rule-based Systems[J]. Journal of the American Medical Informatics Association, 2010, 17(5): 540-544.
(He Linna, Yang Zhihao, Lin Hongfei, et al.Drug Name Entity Recognition Based on Feature Coupling Generalization[J]. Journal of Chinese Information Processing, 2014, 28(2): 72-77.)
[6]
Krauthammer M, Nenadic G.Term Identification in the Biomedical Literature[J]. Journal of Biomedical Informatics, 2004, 37(6): 512-526.
[7]
Lafferty J, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]. In: Proceedings of the 2002 International Conference on Machine Learning. 2002.
[8]
Chowdhury Md F M, Lavelli A. Disease Mention Recognition with Specific Features [C]. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010.
[9]
Lee H C, Hsu Y Y, Kao H Y.An Enhanced CRF-based System for Disease Name Entity Recognition and Normalization on BioCreative V DNER Task [C]. In: Proceedings of the 5th BioCreative Challenge Evaluation Workshop. 2015.
[10]
Lowe D M, Sayle R A.LeadMine: A Grammar and Dictionary Driven Approach to Entity Recognition[J]. Journal of Cheminformatics, 2015, 7(S1): 1-9.
[11]
Leaman R, Wei C H, Lu Z. tmChem: A High Performance Approach for Chemical Named Entity Recognition and Normalization[J]. Journal of Cheminformatics, 2015, 7(S1): 1-10.
[12]
Leaman R, Islamaj Dogan R, Lu Z.DNorm: Disease Name Normalization with Pairwise Learning to Rank[J]. Bioinformatics, 2013, 29(22): 2909-2917.
[13]
Do?an R I, Leaman R, Lu Z.NCBI Disease Corpus: A Resource for Disease Name Recognition and Concept Normalization[J]. Journal of Biomedical Informatics, 2014, 47(2): 1-10.
[14]
Li J, Sun Y, Johnson R J, et al.Annotating Chemicals, Diseases and Their Interactions in Biomedical Literature [C]. In: Proceedings of the 5th BioCreative Challenge Evaluation Workshop. 2015.
[15]
Kim J D, Ohta T, Tateisi Y, et al.GENIA Corpus-- Semantically Annotated Corpus for Bio-textmining[J]. Bioinformatics, 2003, 19(S1): 180-182.
(Xia Guanghui.The Research of Gene Name Entity Recognition Mechanism by Combining Dictionary Method and Machine Learning Method [D]. Beijing: Peking Union Medical College, 2013.)
[17]
Zhang Y, Xu J, Chen H, et al. Chemical Named Entity Recognition in Patents by Domain Knowledge and Unsupervised Feature Learning [J/OL]. The Journal of Biological Databases and Curation [2016-06-10]. .
[18]
何红磊. 基于词表示方法的生物医学命名实体识别[D]. 大连: 大连理工大学, 2015.
[18]
(He Honglei.Research of Word Representations on Biomedical Named Entity Recognition [D]. Dalian: Dalian University of Technology, 2015.)
[19]
Wu Y, Xu J, Jiang M, et al.A Study of Neural Word Embeddings for Named Entity Recognition in Clinical Text [C]. In: Proceedings of the 2015 AMIA Annual Symposium. 2015.
[20]
Brown P F, Desouza P V, Mercer R L, et al.Class-based N-gram Models of Natural Language[J]. Computational Linguistics, 1992, 18(4): 467-479.