Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (5): 83-94    DOI: 10.11925/infotech.2096-3467.2020.1211
Current Issue | Archive | Adv Search |
Normalizing Chinese Disease Names with Multi-feature Fusion
Han Pu1,2(),Zhang Zhanpeng1,Zhang Mingtao1,Gu Liang1
1School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210023, China;
2Jiangsu Provincial Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
Download: PDF (1242 KB)   HTML ( 10
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a normalization model for Chinese disease names based on multi-feature fusion, aiming to address the issue of multiple alternative disease names for online health communities. [Methods] First, we constructed a normalized dataset for Chinese disease names used by online health communities. Second, we conducted experiments in Chinese and English with the LSTM, GRU and CNN models. Third, we generated external semantic feature vectors with Word2vec and GloVe. Finally, we developed the normalization model MFCF-CNN for Chinese disease names based on the multi-feature fusion and self-attention mechanism. [Results] We examined the proposed model with Accuracy @ 10 dataset. The accuracy of our MFCF-CNN model reached 85.48%, which is 8.84% higher than the basic CNN model. Our model made better use of global and local semantic features. [Limitations] The amount of the experiment data needs to be expanded. [Conclusions] The proposed model promotes the normalization of Chinese disease names, which benefits the medical knowledge graph construction and natural language understanding in Chinese.

Key wordsDisease Name Normalization      Supervised Learning      Convolutional Neural Network      Self-attention Mechanism     
Received: 04 December 2020      Published: 27 May 2021
ZTFLH:  G250  
Fund:*The work is supported by the National Social Science Fund of China(17CTQ022);the Jiangsu Graduate Research and Innovation Program Fund Project(KYCX20_0844)
Corresponding Authors: Han Pu     E-mail: hanpu@njupt.edu.cn

Cite this article:

Han Pu,Zhang Zhanpeng,Zhang Mingtao,Gu Liang. Normalizing Chinese Disease Names with Multi-feature Fusion. Data Analysis and Knowledge Discovery, 2021, 5(5): 83-94.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.1211     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I5/83

Convolutional Neural Network Model
Experimental Flowchart
疾病名称 词级文本 字级文本
水痘 背部 腹部 水痘 感觉 瘙痒 水泡患者 局部 皮疹 轻微 疼痛 皮炎平 效果带状疱疹 疼痛感 涂抹 阿昔洛韦 软膏 配合 口服 胸腺肽 肠溶片 增强 免疫力 免疫 功能 低下 背 部 胸 腹 现 水 痘 感 觉 痒 瘙 泡 病 患 局 皮 疹 轻 微 疼 痛 抹 炎 平 效 果 状 疱 涂 昔 洛 韦 软 膏 配 合 口 服 腺 肽 肠 溶 片 增 强 免 疫 力 主 功 低
风湿热 湿热 出汗 畏寒 怕冷 特别 口腔溃疡 嗓子 痛发于 舌尖 唇部 牙龈 胀痛 口腔 异味 月经 病史 服药 过敏史 饮食 偏辣 高血压 高血糖 高血脂 冠心病 高尿酸 血症舌苔 湿 热 汗 畏 寒 冷 特 容 易 口 腔 溃 疡 嗓 子 痛 舌 尖 唇 部 牙 龈 胀 异 味 时 月 正 病 史 服 敏 饮 食 偏 辣 高 血 压 糖 脂 冠 心 尿 酸 症 苔
关节炎 血清 骨钙素 测定 胶原蛋白 序列 维生素 白介素 肿瘤 坏死 因子 日去 好坏 泼尼松 拍片 骨折 随访 减药 关系 血 清 骨 钙 素 测 B 胶 原 蛋 序 列 羟 维 生 D 介 肿 瘤 坏 死 子 日 泼 尼 松 龙 片 吃 拍 骨 折 样 访 减 药 关 系
Examples of Chinese Disease Dataset
Multi-feature Fusion Model MFCF-CNN Based on Self-attention Mechanism
Dataset Division and Model Training Process
外部语义
特征向量
领域 语料来源
Wiki-WCv 通用领域 维基百科2020版
EMR-WCv 临床医学领域 CCKS2017电子病历
MA-WCv 生物医学领域 万方医学网-医学文献摘要
OHC-WCv 在线医疗健康领域 好问康、求医问药网
External Semantic Features and Source of Corpus
疾病名称 疾病描述
Arthritis of knee arthritic knees
Lightheadedness light headed
Myalgia Muscle aches & pains
Taste sense altered taste perversion
Foot pain pain on the sole of my feet
Myositis muscle inflammation
Severe pain severe pain close to my the crotch area
Myalgia soreness of muscles
Examples of English Disease Dataset
模型参数 CNN LSTM GRU
输入句向量维度 100 100 100
卷积核的数量 4 / /
神经元 128 128 128
输入样本数 20 20 20
迭代次数 10 20 20
Dropout机制 0.5
Softmax层数 归一化疾病名称数
注意力机制 自注意力机制
Experimental Parameter Settings
模型 Accuracy@1 Accuracy@5 Accuracy@10
CNN-WRv-ADR 18.71% 47.09% 54.19%
LSTM-WRv-ADR 22.58% 45.81% 68.39%
GRU-WRv-ADR 20.65% 47.10% 65.81%
CNN-WRv-ASK 61.19% 78.10% 80.12%
LSTM-WRv-ASK 65.12% 79.76% 84.76%
GRU-WRv-ASK 66.79% 79.29% 85.12%
CNN-WRv-CDND 60.98% 74.89% 76.64%
LSTM-WRv-CDND 59.34% 72.43% 75.21%
GRU-WRv-CDND 58.97% 71.63% 74.28%
CNN-CRv-CDND 70.06% 83.09% 84.48%
Accuracy of Chinese and English Disease Name Normalization
语义特征 Accuracy@1 Accuracy@5 Accuracy@10
Wiki-WCv 70.30% 83.40% 84.99%
EMR-WCv 69.25% 82.27% 83.75%
MA-WCv 70.36% 83.41% 84.92%
OHC-WCv 70.21% 83.52% 84.90%
Accuracy of Inducing External Semantic Feature Vectors on CNN-WCv Model
模型 Accuracy@1 Accuracy@5 Accuracy@10
CNN-WCv 70.21% 83.52% 84.90%
CNN-GCv 69.62% 83.21% 84.51%
MFCF-CNN-AWCv 70.64% 83.87% 85.28%
MFCF-CNN-AGCv 70.22% 83.71% 85.06%
MFCF-CNN-AWGCv 71.05% 83.95% 85.48%
Accuracy of Chinese Disease Name Normalization Based on MFCF
Comparative Analysis of Experimental Result
[1] Liu X, Zhou Y J, Wang Z R. Recognition and Extraction of Named Entities in Online Medical Diagnosis Data Based on a Deep Neural Network[J]. Journal of Visual Communication and Image Representation, 2019,60:1-15.
doi: 10.1016/j.jvcir.2019.02.001
[2] Wu C C, Luo G, Guo C, et al. An Attention-based Multi-task Model for Named Entity Recognition and Intent Analysis of Chinese Online Medical Questions[J]. Journal of Biomedical Informatics, 2020,108:103511.
doi: 10.1016/j.jbi.2020.103511
[3] 杨文明, 褚伟杰. 在线医疗问答文本的命名实体识别[J]. 计算机系统应用, 2019,28(2):8-14.
[3] ( Yang Wenming, Chu Weijie. Named Entity Recognition of Online Medical Question Answering Text[J]. Computer Systems & Applications, 2019,28(2):8-14.)
[4] 陈美杉, 夏晨曦. 肝癌患者在线提问的命名实体识别研究:一种基于迁移学习的方法[J]. 数据分析与知识发现, 2020,3(12):61-69.
[4] ( Chen Meishan, Xia Chenxi. Identifying Entities of Online Questions from Cancer Patients Based on Transfer Learning[J]. Data Analysis and Knowledge Discovery, 2020,3(12):61-69.)
[5] Nie L Q, Zhao Y L, Akbari M, et al. Bridging the Vocabulary Gap Between Health Seekers and Healthcare Knowledge[J]. IEEE Transactions on Knowledge and Data Engineering, 2014,27(2):396-409.
doi: 10.1109/TKDE.2014.2330813
[6] 金碧漪, 许鑫. 社会化问答社区中糖尿病健康信息的需求分析[J]. 中华医学图书情报杂志, 2014,23(12):37-42.
[6] ( Jin Biyi, Xu Xin. Health Information Needs of Diabetics in Social Q&A Community[J]. Chinese Journal of Medical Library and Information Science, 2014,23(12):37-42.)
[7] 张洪武, 冯思佳, 赵文龙, 等. 基于网络用户搜索行为的健康信息需求分析[J]. 医学信息学杂志, 2011,32(5):13-18.
[7] ( Zhang Hongwu, Feng Sijia, Zhao Wenlong, et al. Analysis of Health Information Needs Based on Network Users Retrieval Behavior[J]. Journal of Medical Informatics, 2011,32(5):13-18.)
[8] Nie L Q, Wang M, Zhang L M, et al. Disease Inference from Health-related Questions via Sparse Deep Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2015,27(8):2107-2119.
doi: 10.1109/TKDE.2015.2399298
[9] Chen X, Yan G Y. Semi-supervised Learning for Potential Human MicroRNA-disease Associations Inference[J]. Scientific Reports, 2014,4(1):5501.
doi: 10.1038/srep05501
[10] Stanovsky G, Gruhl D, Mendes P. Recognizing Mentions of Adverse Drug Reaction in Social Media Using Knowledge-Infused Recurrent Models[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 2017: 142-151.
[11] Tutubalina E, Nikolenko S. Combination of Deep Recurrent Neural Networks and Conditional Random Fields for Extracting Adverse Drug Reactions from User Reviews[J]. Journal of Healthcare Engineering, 2017: Article No. 9451342.
[12] 朱笑笑, 杨尊琦, 刘婧. 基于Bi-LSTM和CRF的药品不良反应抽取模型构建[J]. 数据分析与知识发现, 2019,3(2):90-97.
[12] ( Zhu Xiaoxiao, Yang Zunqi, Liu Jing. Construction of an Adverse Drug Reaction Extraction Model Based on Bi-LSTM and CRF[J]. Data Analysis and Knowledge Discovery, 2019,3(2):90-97.)
[13] Leaman R, Khare R, Lu Z. Challenges in Clinical Natural Language Processing for Automated Disorder Normalization[J]. Journal of Biomedical Informatics, 2015,57:28-37.
doi: 10.1016/j.jbi.2015.07.010
[14] Ching T, Himmelstein D S, Beaulieu-Jones B K, et al. Opportunities and Obstacles for Deep Learning in Biology and Medicine[J]. Journal of the Royal Society Interface, 2018,15:20170387.
doi: 10.1098/rsif.2017.0387
[15] Leaman R, Dogan R I, Lu Z. DNorm: Disease Name Normalization with Pairwise Learning to Rank[J]. Bioinformatics, 2013,29(22):2909-2917.
doi: 10.1093/bioinformatics/btt474
[16] Ristad E S, Yianilos P N. Learning String-edit Distance[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998,20(5):522-532.
doi: 10.1109/34.682181
[17] Aronson A R. Effective Mapping of Biomedical Text to the UMLS Metathesaurus: the MetaMap Program[C]// Proceedings of the AMIA Symposium. 2001: 17-21.
[18] Tsuruoka Y, McNaught J, Tsujii J, et al. Learning String Similarity Measures for Gene/Protein Name Dictionary Look-up Using Logistic Regression[J]. Bioinformatics, 2007,23(20):2768-2774.
doi: 10.1093/bioinformatics/btm393
[19] Kate R J. Normalizing Clinical Terms Using Learned Edit Distance Patterns[J]. Journal of the American Medical Informatics Association, 2016,23(2):380-386.
doi: 10.1093/jamia/ocv108
[20] Jonnagaddala J, Jue T R, Chang N W, et al. Improving the Dictionary Lookup Approach for Disease Normalization Using Enhanced Dictionary and Query Expansion[J]. Database: The Journal of Biological Databases and Curation, 2016. DOI: 10.1093/database/baw112.
doi: 10.1093/database/baw112
[21] Zhang Y Z, Ma X J, Song G J. Chinese Medical Concept Normalization by Using Text and Comorbidity Network Embedding[C]// Proceedings of 2018 IEEE International Conference on Data Mining. 2018: 777-786.
[22] Liu H W, Xu Y. A Deep Learning Way for Disease Name Representation and Normalization[C]// Proceedings of the 8th CCF International Conference on Natural Language Processing and Chinese Computing. 2017: 151-157.
[23] Limsopatham N, Collier N. Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 1014-1023.
[24] Li H D, Chen Q C, Tang B Z, et al. CNN-based Ranking for Biomedical Entity Normalization[J]. BMC Bioinformatics, 2017,18(11):79-86.
doi: 10.1186/s12859-017-1499-x
[25] Tutubalina E, Miftahutdinov Z, Nikolenko S, et al. Sequence Learning with RNNs for Medical Concept Normalization in User-Generated Texts[OL]. arXiv Preprint, arXiv: 1811. 11523.
[26] Niu J H, Yang Y H, Zhang S H, et al. Multi-task Character-Level Attentional Networks for Medical Concept Normalization[J]. Neural Processing Letters, 2019,49(3):1239-1256.
doi: 10.1007/s11063-018-9873-x
[27] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[28] Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[29] Limsopatham N, Collier N. Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 1014-1023.
[30] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997,9(8):1735-1780.
pmid: 9377276
[31] Cho K, van Merriënboer B, Gulcehre C, et al. Learning Phrase Representations Using RNN Encoder-decoder for Statistical Machine Translation[OL]. arXiv Preprint, arXiv: 1406. 1078.
[32] Kim Y. Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408. 5882.
[33] Bahdanau D, Cho K, Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate[OL]. arXiv Preprint, arXiv: 1409. 0473.
[34] Young T, Hazarika D, Poria S, et al. Recent Trends in Deep Learning based Natural Language Processing[J]. IEEE Computational Intelligence Magazine, 2018,13(3):55-75.
doi: 10.1109/MCI.2018.2840738
[35] Tutubalina E, Miftahutdinov Z, Nikolenko S, et al. Medical Concept Normalization in Social Media Posts with Recurrent Neural Networks[J]. Journal of Biomedical Informatics, 2018,84:93-102.
doi: S1532-0464(18)30112-6 pmid: 29906585
[36] Lee K, Hasan S A, Farri O, et al. Medical Concept Normalization for Online User-generated Texts[C]// Proceedings of the IEEE International Conference on Healthcare Informatics. 2017: 462-469.
[37] Tan Z X, Wang M X, Xie J, et al. Deep Semantic Role Labeling with Self-attention[OL]. arXiv Preprint, arXiv: 1712. 01586.
[38] Verga P, Strubell E, McCallum A. Simultaneously Self-attending to All Mentions for Full-abstract Biological Relation Extraction[OL]. arXiv Preprint, arXiv: 1802. 10569.
[39] Woo S, Park J, Lee J Y, et al. CBAM: Convolutional Block Attention Module[C]// Proceedings of the European Conference on Computer Vision. 2018: 3-19.
[40] Subramanyam K K, Sangeetha S. Deep Contextualized Medical Concept Normalization in Social Media Text[J]. Procedia Computer Science, 2020,171:1353-1362.
doi: 10.1016/j.procs.2020.04.145
[41] Dogan R I, Lu Z. An Inference Method for Disease Name Normalization[C]// Proceedings of the AAAI 2012 Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text. 2012: 8-13.
[42] Karadeniz I, Özgür A. Linking Entities Through an Ontology Using Word Embeddings and Syntactic Re-ranking[J]. BMC Bioinformatics, 2019,20(1):156.
doi: 10.1186/s12859-019-2678-8 pmid: 30917789
[1] Liu Tong,Liu Chen,Ni Weijian. A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
[2] Qiu Erli,He Hongwei,Yi Chengqi,Li Huiying. Research on Public Policy Support Based on Character-level CNN Technology[J]. 数据分析与知识发现, 2020, 4(7): 28-37.
[3] Liu Weijiang,Wei Hai,Yun Tianhe. Evaluation Model for Customer Credits Based on Convolutional Neural Network[J]. 数据分析与知识发现, 2020, 4(6): 80-90.
[4] Xu Yuemei,Liu Yunwen,Cai Lianqiao. Predicitng Retweets of Government Microblogs with Deep-combined Features[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[5] Xiang Fei,Xie Yaotan. Recognition Model of Patient Reviews Based on Mixed Sampling and Transfer Learning[J]. 数据分析与知识发现, 2020, 4(2/3): 39-47.
[6] Yu Chuanming,Zhong Yunci,Lin Aochen,An Lu. Author Name Disambiguation with Network Embedding[J]. 数据分析与知识发现, 2020, 4(2/3): 48-59.
[7] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[8] Kan Liu,Lu Chen. Deep Neural Network Learning for Medical Triage[J]. 数据分析与知识发现, 2019, 3(6): 99-108.
[9] Xu Yuemei,Lv Sining,Cai Lianqiao,Zhang Xiaoya. Analyzing News Topic Evolution with Convolutional Neural Networks and Topic2Vec[J]. 数据分析与知识发现, 2018, 2(9): 31-41.
[10] He Wanying,Yang Jianlin. Ranking Learning Method Based on Random Walk Model[J]. 数据分析与知识发现, 2017, 1(12): 41-48.
[11] Sui Mingshuang,Cui Lei. Extracting Chemical and Disease Named Entities with Multiple-Feature CRF Model[J]. 现代图书情报技术, 2016, 32(10): 91-97.
[12] Zhang Qian, Liu Huailiang. An Algorithm of Short Text Classification Based on Semi-supervised Learning[J]. 现代图书情报技术, 2013, 29(2): 30-35.
[13] Shi Jing,Zhang Lijuan. Extending Inside-outside Algorithm by Using HowNet[J]. 现代图书情报技术, 2009, 25(7-8): 54-58.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn