Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (5): 83-94     https://doi.org/10.11925/infotech.2096-3467.2020.1211
     研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于多特征融合的中文疾病名称归一化研究*
韩普1,2(),张展鹏1,张明淘1,顾亮1
1南京邮电大学管理学院 南京 210003
2江苏省数据工程与知识服务重点实验室 南京 210023
Normalizing Chinese Disease Names with Multi-feature Fusion
Han Pu1,2(),Zhang Zhanpeng1,Zhang Mingtao1,Gu Liang1
1School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210023, China;
2Jiangsu Provincial Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
全文: PDF (1242 KB)   HTML ( 18
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 针对在线健康社区疾病名称存在多种指称的问题,提出基于多特征融合的中文疾病名称归一化模型。【方法】 基于在线健康社区构建中文疾病名称归一化数据集;采用LSTM、GRU和CNN模型进行中英文对照实验,利用Word2Vec和GloVe生成外部语义特征向量,并通过CNN模型进行验证;最后在自注意力机制基础上,提出多特征融合的中文疾病名称归一化模型MFCF-CNN,更好地利用全局和局部语义特征。【结果】 实验表明,在中文数据集 Accuracy@ 10 指标上,MFCF-CNN模型准确率可以达到85.48%,较CNN基础模型提高8.84%。【局限】 所构建的数据集规模较小,需要进一步增加数据量以体现模型泛化性。【结论】 进一步推动了中文疾病名称归一化研究,为中文医学知识图谱构建和自然语言理解提供帮助。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
韩普
张展鹏
张明淘
顾亮
关键词 疾病名称归一化有监督学习卷积神经网络自注意力机制    
Abstract

[Objective] This paper proposes a normalization model for Chinese disease names based on multi-feature fusion, aiming to address the issue of multiple alternative disease names for online health communities. [Methods] First, we constructed a normalized dataset for Chinese disease names used by online health communities. Second, we conducted experiments in Chinese and English with the LSTM, GRU and CNN models. Third, we generated external semantic feature vectors with Word2vec and GloVe. Finally, we developed the normalization model MFCF-CNN for Chinese disease names based on the multi-feature fusion and self-attention mechanism. [Results] We examined the proposed model with Accuracy @ 10 dataset. The accuracy of our MFCF-CNN model reached 85.48%, which is 8.84% higher than the basic CNN model. Our model made better use of global and local semantic features. [Limitations] The amount of the experiment data needs to be expanded. [Conclusions] The proposed model promotes the normalization of Chinese disease names, which benefits the medical knowledge graph construction and natural language understanding in Chinese.

Key wordsDisease Name Normalization    Supervised Learning    Convolutional Neural Network    Self-attention Mechanism
收稿日期: 2020-12-04      出版日期: 2021-05-27
ZTFLH:  G250  
基金资助:*本文系国家社会科学基金项目(17CTQ022);江苏研究生科研创新计划基金项目的研究成果之一(KYCX20_0844)
通讯作者: 韩普     E-mail: hanpu@njupt.edu.cn
引用本文:   
韩普,张展鹏,张明淘,顾亮. 基于多特征融合的中文疾病名称归一化研究*[J]. 数据分析与知识发现, 2021, 5(5): 83-94.
Han Pu,Zhang Zhanpeng,Zhang Mingtao,Gu Liang. Normalizing Chinese Disease Names with Multi-feature Fusion. Data Analysis and Knowledge Discovery, 2021, 5(5): 83-94.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.1211      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I5/83
Fig.1  卷积神经网络模型
Fig.2  实验流程
疾病名称 词级文本 字级文本
水痘 背部 腹部 水痘 感觉 瘙痒 水泡患者 局部 皮疹 轻微 疼痛 皮炎平 效果带状疱疹 疼痛感 涂抹 阿昔洛韦 软膏 配合 口服 胸腺肽 肠溶片 增强 免疫力 免疫 功能 低下 背 部 胸 腹 现 水 痘 感 觉 痒 瘙 泡 病 患 局 皮 疹 轻 微 疼 痛 抹 炎 平 效 果 状 疱 涂 昔 洛 韦 软 膏 配 合 口 服 腺 肽 肠 溶 片 增 强 免 疫 力 主 功 低
风湿热 湿热 出汗 畏寒 怕冷 特别 口腔溃疡 嗓子 痛发于 舌尖 唇部 牙龈 胀痛 口腔 异味 月经 病史 服药 过敏史 饮食 偏辣 高血压 高血糖 高血脂 冠心病 高尿酸 血症舌苔 湿 热 汗 畏 寒 冷 特 容 易 口 腔 溃 疡 嗓 子 痛 舌 尖 唇 部 牙 龈 胀 异 味 时 月 正 病 史 服 敏 饮 食 偏 辣 高 血 压 糖 脂 冠 心 尿 酸 症 苔
关节炎 血清 骨钙素 测定 胶原蛋白 序列 维生素 白介素 肿瘤 坏死 因子 日去 好坏 泼尼松 拍片 骨折 随访 减药 关系 血 清 骨 钙 素 测 B 胶 原 蛋 序 列 羟 维 生 D 介 肿 瘤 坏 死 子 日 泼 尼 松 龙 片 吃 拍 骨 折 样 访 减 药 关 系
Table 1  中文疾病数据集实例
Fig.3  基于自注意机制的多特征融合模型MFCF-CNN
Fig.4  数据集划分及模型训练流程
外部语义
特征向量
领域 语料来源
Wiki-WCv 通用领域 维基百科2020版
EMR-WCv 临床医学领域 CCKS2017电子病历
MA-WCv 生物医学领域 万方医学网-医学文献摘要
OHC-WCv 在线医疗健康领域 好问康、求医问药网
Table 2  外部语义特征及语料来源
疾病名称 疾病描述
Arthritis of knee arthritic knees
Lightheadedness light headed
Myalgia Muscle aches & pains
Taste sense altered taste perversion
Foot pain pain on the sole of my feet
Myositis muscle inflammation
Severe pain severe pain close to my the crotch area
Myalgia soreness of muscles
Table 3  英文疾病数据集实例
模型参数 CNN LSTM GRU
输入句向量维度 100 100 100
卷积核的数量 4 / /
神经元 128 128 128
输入样本数 20 20 20
迭代次数 10 20 20
Dropout机制 0.5
Softmax层数 归一化疾病名称数
注意力机制 自注意力机制
Table 4  实验参数设置
模型 Accuracy@1 Accuracy@5 Accuracy@10
CNN-WRv-ADR 18.71% 47.09% 54.19%
LSTM-WRv-ADR 22.58% 45.81% 68.39%
GRU-WRv-ADR 20.65% 47.10% 65.81%
CNN-WRv-ASK 61.19% 78.10% 80.12%
LSTM-WRv-ASK 65.12% 79.76% 84.76%
GRU-WRv-ASK 66.79% 79.29% 85.12%
CNN-WRv-CDND 60.98% 74.89% 76.64%
LSTM-WRv-CDND 59.34% 72.43% 75.21%
GRU-WRv-CDND 58.97% 71.63% 74.28%
CNN-CRv-CDND 70.06% 83.09% 84.48%
Table 5  中英文疾病名称归一化准确率
语义特征 Accuracy@1 Accuracy@5 Accuracy@10
Wiki-WCv 70.30% 83.40% 84.99%
EMR-WCv 69.25% 82.27% 83.75%
MA-WCv 70.36% 83.41% 84.92%
OHC-WCv 70.21% 83.52% 84.90%
Table 6  CNN-WCv模型上引用外部语义特征向量的准确率
模型 Accuracy@1 Accuracy@5 Accuracy@10
CNN-WCv 70.21% 83.52% 84.90%
CNN-GCv 69.62% 83.21% 84.51%
MFCF-CNN-AWCv 70.64% 83.87% 85.28%
MFCF-CNN-AGCv 70.22% 83.71% 85.06%
MFCF-CNN-AWGCv 71.05% 83.95% 85.48%
Table 7  基于多特征融合的中文疾病名称归一化准确率
Fig.5  实验结果对比分析
[1] Liu X, Zhou Y J, Wang Z R. Recognition and Extraction of Named Entities in Online Medical Diagnosis Data Based on a Deep Neural Network[J]. Journal of Visual Communication and Image Representation, 2019,60:1-15.
doi: 10.1016/j.jvcir.2019.02.001
[2] Wu C C, Luo G, Guo C, et al. An Attention-based Multi-task Model for Named Entity Recognition and Intent Analysis of Chinese Online Medical Questions[J]. Journal of Biomedical Informatics, 2020,108:103511.
doi: 10.1016/j.jbi.2020.103511
[3] 杨文明, 褚伟杰. 在线医疗问答文本的命名实体识别[J]. 计算机系统应用, 2019,28(2):8-14.
[3] ( Yang Wenming, Chu Weijie. Named Entity Recognition of Online Medical Question Answering Text[J]. Computer Systems & Applications, 2019,28(2):8-14.)
[4] 陈美杉, 夏晨曦. 肝癌患者在线提问的命名实体识别研究:一种基于迁移学习的方法[J]. 数据分析与知识发现, 2020,3(12):61-69.
[4] ( Chen Meishan, Xia Chenxi. Identifying Entities of Online Questions from Cancer Patients Based on Transfer Learning[J]. Data Analysis and Knowledge Discovery, 2020,3(12):61-69.)
[5] Nie L Q, Zhao Y L, Akbari M, et al. Bridging the Vocabulary Gap Between Health Seekers and Healthcare Knowledge[J]. IEEE Transactions on Knowledge and Data Engineering, 2014,27(2):396-409.
doi: 10.1109/TKDE.2014.2330813
[6] 金碧漪, 许鑫. 社会化问答社区中糖尿病健康信息的需求分析[J]. 中华医学图书情报杂志, 2014,23(12):37-42.
[6] ( Jin Biyi, Xu Xin. Health Information Needs of Diabetics in Social Q&A Community[J]. Chinese Journal of Medical Library and Information Science, 2014,23(12):37-42.)
[7] 张洪武, 冯思佳, 赵文龙, 等. 基于网络用户搜索行为的健康信息需求分析[J]. 医学信息学杂志, 2011,32(5):13-18.
[7] ( Zhang Hongwu, Feng Sijia, Zhao Wenlong, et al. Analysis of Health Information Needs Based on Network Users Retrieval Behavior[J]. Journal of Medical Informatics, 2011,32(5):13-18.)
[8] Nie L Q, Wang M, Zhang L M, et al. Disease Inference from Health-related Questions via Sparse Deep Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2015,27(8):2107-2119.
doi: 10.1109/TKDE.2015.2399298
[9] Chen X, Yan G Y. Semi-supervised Learning for Potential Human MicroRNA-disease Associations Inference[J]. Scientific Reports, 2014,4(1):5501.
doi: 10.1038/srep05501
[10] Stanovsky G, Gruhl D, Mendes P. Recognizing Mentions of Adverse Drug Reaction in Social Media Using Knowledge-Infused Recurrent Models[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 2017: 142-151.
[11] Tutubalina E, Nikolenko S. Combination of Deep Recurrent Neural Networks and Conditional Random Fields for Extracting Adverse Drug Reactions from User Reviews[J]. Journal of Healthcare Engineering, 2017: Article No. 9451342.
[12] 朱笑笑, 杨尊琦, 刘婧. 基于Bi-LSTM和CRF的药品不良反应抽取模型构建[J]. 数据分析与知识发现, 2019,3(2):90-97.
[12] ( Zhu Xiaoxiao, Yang Zunqi, Liu Jing. Construction of an Adverse Drug Reaction Extraction Model Based on Bi-LSTM and CRF[J]. Data Analysis and Knowledge Discovery, 2019,3(2):90-97.)
[13] Leaman R, Khare R, Lu Z. Challenges in Clinical Natural Language Processing for Automated Disorder Normalization[J]. Journal of Biomedical Informatics, 2015,57:28-37.
doi: 10.1016/j.jbi.2015.07.010
[14] Ching T, Himmelstein D S, Beaulieu-Jones B K, et al. Opportunities and Obstacles for Deep Learning in Biology and Medicine[J]. Journal of the Royal Society Interface, 2018,15:20170387.
doi: 10.1098/rsif.2017.0387
[15] Leaman R, Dogan R I, Lu Z. DNorm: Disease Name Normalization with Pairwise Learning to Rank[J]. Bioinformatics, 2013,29(22):2909-2917.
doi: 10.1093/bioinformatics/btt474
[16] Ristad E S, Yianilos P N. Learning String-edit Distance[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998,20(5):522-532.
doi: 10.1109/34.682181
[17] Aronson A R. Effective Mapping of Biomedical Text to the UMLS Metathesaurus: the MetaMap Program[C]// Proceedings of the AMIA Symposium. 2001: 17-21.
[18] Tsuruoka Y, McNaught J, Tsujii J, et al. Learning String Similarity Measures for Gene/Protein Name Dictionary Look-up Using Logistic Regression[J]. Bioinformatics, 2007,23(20):2768-2774.
doi: 10.1093/bioinformatics/btm393
[19] Kate R J. Normalizing Clinical Terms Using Learned Edit Distance Patterns[J]. Journal of the American Medical Informatics Association, 2016,23(2):380-386.
doi: 10.1093/jamia/ocv108
[20] Jonnagaddala J, Jue T R, Chang N W, et al. Improving the Dictionary Lookup Approach for Disease Normalization Using Enhanced Dictionary and Query Expansion[J]. Database: The Journal of Biological Databases and Curation, 2016. DOI: 10.1093/database/baw112.
doi: 10.1093/database/baw112
[21] Zhang Y Z, Ma X J, Song G J. Chinese Medical Concept Normalization by Using Text and Comorbidity Network Embedding[C]// Proceedings of 2018 IEEE International Conference on Data Mining. 2018: 777-786.
[22] Liu H W, Xu Y. A Deep Learning Way for Disease Name Representation and Normalization[C]// Proceedings of the 8th CCF International Conference on Natural Language Processing and Chinese Computing. 2017: 151-157.
[23] Limsopatham N, Collier N. Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 1014-1023.
[24] Li H D, Chen Q C, Tang B Z, et al. CNN-based Ranking for Biomedical Entity Normalization[J]. BMC Bioinformatics, 2017,18(11):79-86.
doi: 10.1186/s12859-017-1499-x
[25] Tutubalina E, Miftahutdinov Z, Nikolenko S, et al. Sequence Learning with RNNs for Medical Concept Normalization in User-Generated Texts[OL]. arXiv Preprint, arXiv: 1811. 11523.
[26] Niu J H, Yang Y H, Zhang S H, et al. Multi-task Character-Level Attentional Networks for Medical Concept Normalization[J]. Neural Processing Letters, 2019,49(3):1239-1256.
doi: 10.1007/s11063-018-9873-x
[27] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[28] Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[29] Limsopatham N, Collier N. Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 1014-1023.
[30] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997,9(8):1735-1780.
pmid: 9377276
[31] Cho K, van Merriënboer B, Gulcehre C, et al. Learning Phrase Representations Using RNN Encoder-decoder for Statistical Machine Translation[OL]. arXiv Preprint, arXiv: 1406. 1078.
[32] Kim Y. Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408. 5882.
[33] Bahdanau D, Cho K, Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate[OL]. arXiv Preprint, arXiv: 1409. 0473.
[34] Young T, Hazarika D, Poria S, et al. Recent Trends in Deep Learning based Natural Language Processing[J]. IEEE Computational Intelligence Magazine, 2018,13(3):55-75.
doi: 10.1109/MCI.2018.2840738
[35] Tutubalina E, Miftahutdinov Z, Nikolenko S, et al. Medical Concept Normalization in Social Media Posts with Recurrent Neural Networks[J]. Journal of Biomedical Informatics, 2018,84:93-102.
doi: S1532-0464(18)30112-6 pmid: 29906585
[36] Lee K, Hasan S A, Farri O, et al. Medical Concept Normalization for Online User-generated Texts[C]// Proceedings of the IEEE International Conference on Healthcare Informatics. 2017: 462-469.
[37] Tan Z X, Wang M X, Xie J, et al. Deep Semantic Role Labeling with Self-attention[OL]. arXiv Preprint, arXiv: 1712. 01586.
[38] Verga P, Strubell E, McCallum A. Simultaneously Self-attending to All Mentions for Full-abstract Biological Relation Extraction[OL]. arXiv Preprint, arXiv: 1802. 10569.
[39] Woo S, Park J, Lee J Y, et al. CBAM: Convolutional Block Attention Module[C]// Proceedings of the European Conference on Computer Vision. 2018: 3-19.
[40] Subramanyam K K, Sangeetha S. Deep Contextualized Medical Concept Normalization in Social Media Text[J]. Procedia Computer Science, 2020,171:1353-1362.
doi: 10.1016/j.procs.2020.04.145
[41] Dogan R I, Lu Z. An Inference Method for Disease Name Normalization[C]// Proceedings of the AAAI 2012 Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text. 2012: 8-13.
[42] Karadeniz I, Özgür A. Linking Entities Through an Ontology Using Word Embeddings and Syntactic Re-ranking[J]. BMC Bioinformatics, 2019,20(1):156.
doi: 10.1186/s12859-019-2678-8 pmid: 30917789
[1] 范少萍,赵雨宣,安新颖,吴清强. 基于卷积神经网络的医学实体关系分类模型研究*[J]. 数据分析与知识发现, 2021, 5(9): 75-84.
[2] 范涛,王昊,吴鹏. 基于图卷积神经网络和依存句法分析的网民负面情感分析研究*[J]. 数据分析与知识发现, 2021, 5(9): 97-106.
[3] 孟镇,王昊,虞为,邓三鸿,张宝隆. 基于特征融合的声乐分类研究*[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[4] 邱尔丽,何鸿魏,易成岐,李慧颖. 基于字符级CNN技术的公共政策网民支持度研究 *[J]. 数据分析与知识发现, 2020, 4(7): 28-37.
[5] 刘伟江,魏海,运天鹤. 基于卷积神经网络的客户信用评估模型研究*[J]. 数据分析与知识发现, 2020, 4(6): 80-90.
[6] 徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[7] 向菲,谢耀谈. 基于混合采样与迁移学习的患者评论识别模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 39-47.
[8] 彭郴,吕学强,孙宁,张乐,姜肇财,宋黎. 基于CNN的消费品缺陷领域词典构建方法研究*[J]. 数据分析与知识发现, 2020, 4(11): 112-120.
[9] 聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 *[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[10] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[11] 刘勘,陈露. 面向医疗分诊的深度神经网络学习*[J]. 数据分析与知识发现, 2019, 3(6): 99-108.
[12] 徐月梅, 吕思凝, 蔡连侨, 张小娅. 结合卷积神经网络和Topic2Vec的新闻主题演变分析*[J]. 数据分析与知识发现, 2018, 2(9): 31-41.
[13] 黄孝喜, 李晗雨, 王荣波, 王小华, 谌志群. 基于卷积神经网络与SVM分类器的隐喻识别*[J]. 数据分析与知识发现, 2018, 2(10): 77-83.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn