Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (2/3): 233-241     https://doi.org/10.11925/infotech.2096-3467.2021.0954
  专辑 本期目录 | 过刊浏览 | 高级检索 |
面向不平衡数据的电子病历自动分类研究*
张云秋1(),李博诚1,陈妍2
1吉林大学公共卫生学院 长春 130021
2深圳市卫生健康发展研究和数据管理中心 深圳 518028
Automatic Classification with Unbalanced Data for Electronic Medical Records
Zhang Yunqiu1(),Li Bocheng1,Chen Yan2
1College of Public Health, Jilin University, Changchun 130021, China
2Shenzhen Health Development Research and Data Management Center, Shenzhen 518028, China
全文: PDF (829 KB)   HTML ( 16
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 提出一种面向不平衡数据的电子病历自动分类方法,以进一步提高临床电子病历分类性能。【方法】 利用MC-BERT增强电子病历的语义表示,并设计了相应的深度神经网络框架以提高模型的语义提取能力,最终利用类别数量比例、梯度协调机制和类别相似度从样本数量不平衡和样本分类难度不平衡两个角度设计了新的损失函数。【结果】 通过真实电子病历数据集进行实证和对比实验,本文方法的精确率、宏平均F1值、微平均F1值分别为81.37%、65.89%、81.47%,优于前人提出的分类方法。【局限】 仅针对单一临床科室的病历进行了实证研究。【结论】 面向不平衡数据的电子病历自动分类方法可以有效地提高电子病历分类性能。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
张云秋
李博诚
陈妍
关键词 不平衡数据深度学习电子病历代价敏感学习    
Abstract

[Objective] This paper proposes an automatic classification method for electronic medical records with unbalanced data, aiming to further improve the classification performance of clinical electronic medical records. [Methods] First, we used the MC-BERT to enhance the semantic representation of electronic medical records. Then, we designed a deep neural network framework to improve the model’s semantic extraction capabilities. Finally, we designed a new loss function from the perspectives of the unbalanced sample categories and difficulty of classification. The proportion of categories, gradient coordination mechanism, and categories similarity were added to the model. [Results] We examined the new model with real electronic medical records. Its accuracy reached 81.37%, while the macro-average F1 value was 65.89%, and the micro-average F1 value was 81.47%. These results are better than the existing methods. [Limitations] We only retrieved medical records from one department. [Conclusions] The proposed method can effectively improve the classification results of unbalanced data.

Key wordsUnbalanced Data    Deep Learning    Electronic Medical Records    Cost-Sensitive Learning
收稿日期: 2021-08-31      出版日期: 2022-04-14
ZTFLH:  TP391  
基金资助:*教育部人文社会科学规划项目(18YJA870017);深圳市医学信息中心委托项目(2020(261));吉林大学研究生创新基金项目的研究成果之一(101832020CX279)
通讯作者: 张云秋,ORCID:0000-0002-9790-9581     E-mail: yunqiu@jlu.edu.cn
引用本文:   
张云秋, 李博诚, 陈妍. 面向不平衡数据的电子病历自动分类研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 233-241.
Zhang Yunqiu, Li Bocheng, Chen Yan. Automatic Classification with Unbalanced Data for Electronic Medical Records. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 233-241.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0954      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I2/3/233
Fig.1  面向不平衡数据的分类模型框架
Fig.2  疾病知识网络示意
Fig.3  数据集中的疾病分布
实验环境 实验配置
GPU GTX 1050TI(1块)
CPU E5-2678V3
开发环境 Python3.7.3 TensorFlow1.15.2
Epoch 20
LSTM学习率 0.001
Dropout 0.5
Table 1  实验环境设置
混淆矩阵 预测值
Postive Negative
实际值 Postive TP FN
Negative FP TN
Table 2  混淆矩阵
方法 ACC Macro F1 Micro F1
BERT-base-Chinese 78.86% 63.48% 76.33%
ERNIE 80.23% 64.67% 79.67%
本文方法 81.37% 65.89% 81.47%
Table 3  预训练模型对比结果
方法 ACC Macro F1 Micro F1
Text-CNN 79.21% 64.33% 80.03%
BiLSTM 71.29% 58.21% 70.56%
BiGRU 72.43% 60.07% 72.23%
Text-CNN-BiGRU 80.44% 64.32% 80.21%
本文方法 81.37% 65.89% 81.47%
Table4  神经网络模型对比结果
方法 ACC Macro F1 Micro F1
加权交叉熵 74.83% 60.22% 74.97%
Focal Loss 77.35% 63.88% 77.53%
GHM 80.57% 65.79% 79.92%
本文方法 81.37% 65.89% 81.47%
Table 5  损失函数对比结果
[1] 卫生部, 国家中药管理局. 病历书写基本规范(试行)[J]. 中国卫生法制, 2002, 10(5):183-186.
[1] (Ministry of Health, State Administration of Traditional Chinese Medicine. Basic Norms for Medical Record Writing (for Trial Implementation)[J]. China Health Law, 2002, 10(5):183-186.)
[2] Uribe M O. International Classification of Diseases, World Health Organization, Tenth Version ICD-10[J]. Salud Mental, 1996, 19:11-18.
[3] Hirsch J A, Nicola G, McGinty G, et al. ICD-10: History and Context[J]. American Journal of Neuroradiology, 2016, 37(4):596-599.
doi: 10.3174/ajnr.A4696 pmid: 26822730
[4] Chen C W, Tseng S P, Kuan T W, et al. Outpatient Text Classification Using Attention-Based Bidirectional LSTM for Robot-Assisted Servicing in Hospital[J]. Information, 2020, 11(2):106.
doi: 10.3390/info11020106
[5] 钟佳娃, 刘巍, 王思丽, 等. 文本情感分析方法及应用综述[J]. 数据分析与知识发现, 2021, 5(6):1-13.
[5] ( Zhong Jiawa, Liu Wei, Wang Sili, et al. Review of Methods and Applications of Text Sentiment Analysis[J]. Data Analysis and Knowledge Discovery, 2021, 5(6):1-13.)
[6] Koopman B, Zuccon G, Nguyen A, et al. Automatic ICD-10 Classification of Cancers from Free-Text Death Certificates[J]. International Journal of Medical Informatics, 2015, 84(11):956-965.
doi: 10.1016/j.ijmedinf.2015.08.004 pmid: 26323193
[7] Perotte A, Pivovarov R, Natarajan K, et al. Diagnosis Code Assignment: Models and Evaluation Metrics[J]. Journal of the American Medical Informatics Association, 2013, 21(2):231-237.
doi: 10.1136/amiajnl-2013-002159
[8] 孙松涛. 基于深度学习的微博复杂情感分类研究[D]. 武汉: 武汉大学, 2017.
[8] ( Sun Songtao. Research of Complex Emotion Classification for Microblog Based on Deep Learning[D]. Wuhan: Wuhan University, 2017.)
[9] Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2014: 1746-1751.
[10] Chiu J P C, Nichols E. Named Entity Recognition with Bidirectional LSTM-CNNS[J]. Transactions of the Association for Computational Linguistics, 2016, 4:357-370.
doi: 10.1162/tacl_a_00104
[11] Mullenbach J, Wiegreffe S, Duke J, et al. Explainable Prediction of Medical Codes from Clinical Text[OL]. arXiv Preprint, arXiv:1802.05695.
[12] Duarte F, Martins B, Pinto C S, et al. Deep Neural Models for ICD-10 Coding of Death Certificates and Autopsy Reports in Free-Text[J]. Journal of Biomedical Informatics, 2018, 80:64-77.
doi: 10.1016/j.jbi.2018.02.011
[13] Baumel T, Nassour-Kassis J, Elhadad M, et al. Multi-Label Classification of Patient Notes a Case Study on ICD Code Assignment[OL]. arXiv Preprint, arXiv: 1709.09587.
[14] 张虹科, 付振新, 任前平, 等. 基于融合条目词嵌入和注意力机制的自动ICD编码[J]. 北京大学学报(自然科学版), 2020, 56(1):1-8.
[14] ( Zhang Hongke, Fu Zhenxin, Ren Qianping, et al. Automated ICD Coding Based on Word Embedding with Entry Embedding and Attention Mechanism[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2020, 56(1):1-8.)
[15] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique[J]. Journal of Artificial Intelligence Research, 2002, 16:321-357.
doi: 10.1613/jair.953
[16] Han H, Wang W Y, Mao B H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning[C]// Proceedings of the 2005 International Conference on Intelligent Computing. 2005: 878-887.
[17] He H B, Bai Y, Garcia E A, et al. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning[C]// Proceedings of the 2008 IEEE International Joint Conference on Neural Networks. IEEE, 2008: 1322-1328.
[18] Laurikkala J. Improving Identification of Difficult Small Classes by Balancing Class Distribution[C]// Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe. 2001: 63-66.
[19] Tomek I. Two Modifications of CNN[J]. IEEE Transactions on Systems Man & Cybernetics, 1976,SMC- 6(11):769-772.
[20] Ng W W Y, Hu J J, Yeung D S, et al. Diversified Sensitivity-Based Undersampling for Imbalance Classification Problems[J]. IEEE Transactions on Cybernetics, 2015, 45(11):2402-2412.
doi: 10.1109/TCYB.2014.2372060
[21] Ling C X, Li C. Data Mining for Direct Marketing: Problems and Solutions[C]// Proceedings of the 14th International Conference on Knowledge Discovery and Data Mining. 1998, 98:73-79.
[22] Weiss G M, Provost F. Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction[J]. Journal of Artificial Intelligence Research, 2003, 19:315-354.
doi: 10.1613/jair.1199
[23] 翟云, 杨炳儒, 曲武. 不平衡类数据挖掘研究综述[J]. 计算机科学, 2010, 37(10):27-32.
[23] ( Zhai Yun, Yang Bingru, Qu Wu. Survey of Mining Imbalanced Datasets[J]. Computer Science, 2010, 37(10):27-32.)
[24] Liu F T, Ting K M, Zhou Z H. Isolation-Based Anomaly Detection[J]. ACM Transactions on Knowledge Discovery from Data, 2012, 6(1): Article No.3.
[25] 薛安荣, 鞠时光, 何伟华, 等. 局部离群点挖掘算法研究[J]. 计算机学报, 2007, 30(8):1455-1463.
[25] ( Xue Anrong, Ju Shiguang, He Weihua, et al. Study on Algorithms for Local Outlier Detection[J]. Chinese Journal of Computers, 2007, 30(8):1455-1463.)
[26] 王飞. iLOF*: 一种改进的局部异常检测算法[J]. 计算机系统应用, 2015, 24(12):233-238.
[26] ( Wang Fei. iLOF*: An Optimized Local Outlier Detection Algorithm[J]. Computer Systems & Applications, 2015, 24(12):233-238.)
[27] Raskutti B, Kowalczyk A. Extreme re-Balancing for SVMS: A Case Study[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1):60-69.
doi: 10.1145/1007730.1007739
[28] Wilk T, Wozniak M. Soft Computing Methods Applied to Combination of One-Class Classifiers[J]. Neurocomputing, 2012, 75(1):185-193.
doi: 10.1016/j.neucom.2011.02.023
[29] Japkowicz N. Concept-Learning in the Presence of Between-Class and Within-Class Imbalances[C]// Proceedings of the 14th Biennial Conference of the Canadian Society for Computational Studies of Intelligence. 2001: 67-77.
[30] Lin T Y, Goyal P, Girshick R, et al. Focal Loss for Dense Object Detection[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. 2017: 2999-3007.
[31] Li B Y, Liu Y, Wang X G. Gradient Harmonized Single-Stage Detector[C]// Proceedings of the 2019 AAAI Conference on Artificial Intelligence. 2019, 33:8577-8584.
[32] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[33] Zhang N, Jia Q, Yin K, et al. Conceptualized Representation Learning for Chinese Biomedical Text Mining[OL]. arXiv Preprint, arXiv:2008.10813.
[34] 王业沛. 基于深度学习的判决结果倾向性分析研究[D]. 南京: 南京大学, 2018.
[34] ( Wang Yepei. Orientation Analysis of Judgment Results Based on Deep Learning[D]. Nanjing: Nanjing University, 2018.)
[1] 张云秋, 汪洋, 李博诚. 基于RoBERTa-wwm动态融合模型的中文电子病历命名实体识别*[J]. 数据分析与知识发现, 2022, 6(2/3): 242-250.
[2] 余传明, 林虹君, 张贞港. 基于多任务深度学习的实体和事件联合抽取模型*[J]. 数据分析与知识发现, 2022, 6(2/3): 117-128.
[3] 张芳丛, 秦秋莉, 姜勇, 庄润涛. 基于RoBERTa-WWM-BiLSTM-CRF的中文电子病历命名实体识别研究[J]. 数据分析与知识发现, 2022, 6(2/3): 251-262.
[4] 胡雅敏, 吴晓燕, 陈方. 基于机器学习的技术术语识别研究综述[J]. 数据分析与知识发现, 2022, 6(2/3): 7-17.
[5] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[6] 徐月梅, 王子厚, 吴子歆. 一种基于CNN-BiLSTM多特征融合的股票走势预测模型*[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[7] 赵丹宁,牟冬梅,白森. 基于深度学习的科技文献摘要结构要素自动抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[8] 黄名选,蒋曹清,卢守东. 基于词嵌入与扩展词交集的查询扩展*[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[9] 钟佳娃,刘巍,王思丽,杨恒. 文本情感分析方法及应用综述*[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[10] 张国标,李洁. 融合多模态内容语义一致性的社交媒体虚假新闻检测*[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[11] 马莹雪,甘明鑫,肖克峻. 融合标签和内容信息的矩阵分解推荐方法*[J]. 数据分析与知识发现, 2021, 5(5): 71-82.
[12] 成彬,施水才,都云程,肖诗斌. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[13] 常城扬,王晓东,张胜磊. 基于深度学习方法对特定群体推特的动态政治情感极性分析*[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[14] 冯勇,刘洋,徐红艳,王嵘冰,张永刚. 融合近邻评论的GRU商品推荐模型*[J]. 数据分析与知识发现, 2021, 5(3): 78-87.
[15] 胡昊天,吉晋锋,王东波,邓三鸿. 基于深度学习的食品安全事件实体一体化呈现平台构建*[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn