Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (2/3): 39-47     https://doi.org/10.11925/infotech.2096-3467.2019.0549
  专辑 本期目录 | 过刊浏览 | 高级检索 |
基于混合采样与迁移学习的患者评论识别模型*
向菲(),谢耀谈
华中科技大学同济医学院医药卫生管理学院 武汉 430030
Recognition Model of Patient Reviews Based on Mixed Sampling and Transfer Learning
Xiang Fei(),Xie Yaotan
School of Medicine and Health Management, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
全文: PDF (890 KB)   HTML ( 9
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 针对患者评论中的不均衡样本数据,提出一种基于混合采样与迁移学习的端到端的卷积神经网络模型。【方法】 采用混合采样与迁移学习的方法解决样本不均衡问题,并利用Word2Vec与卷积神经网络相结合的端到端深度学习架构对患者评论文本进行分布式表示、特征提取以及主题分类。【结果】 采用混合采样与迁移学习的主题识别模型相比,以SVM为代表的传统机器学习模型以及单一卷积神经网络模型在准确率、召回率以及F1值上有明显提升。【局限】 本研究的不均衡样本仅针对在线患者评论文本。【结论】 本研究提出的基于混合采样与迁移学习的患者评论识别模型在应对不均衡样本问题时能够有效提升患者评论识别效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
向菲
谢耀谈
关键词 混合采样迁移学习不均衡样本卷积神经网络患者评论识别    
Abstract

[Objective] This study proposes a new convolutional neural network model, aiming to process the imbalanced data of online patient reviews.[Methods] First, we established the new model with mixed sampling and transfer learning techniques. Then we used end-to-end deep learning architecture based on Word2Vector and convolutional neural network for the distributed representation, feature extraction and topic classification of online patient reviews.[Results] Compared with traditional machine learning algorithm represented by SVM and single convolutional neural network, the proposed model significantly improved the accuracy, recall and F1 values.[Limitations] The imbalanced data of this study was only from online patient reviews.[Conclusions] The proposed model could effectively improve the recognition results of imbalanced data.

Key wordsMixed Sampling    Transfer Learning    Imbalanced Data    Convolutional Neural Network    Patient Reviews Recognition
收稿日期: 2019-05-24      出版日期: 2020-04-26
ZTFLH:  TP393  
基金资助:*本文系华中科技大学自主创新基金项目“社区健康信息空间构建模式与服务设计”(2014AA034);中央高校基本科研业务费资助项目的研究成果之一
通讯作者: 向菲     E-mail: xiangfei@hust.edu.cn
引用本文:   
向菲,谢耀谈. 基于混合采样与迁移学习的患者评论识别模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 39-47.
Xiang Fei,Xie Yaotan. Recognition Model of Patient Reviews Based on Mixed Sampling and Transfer Learning. Data Analysis and Knowledge Discovery, 2020, 4(2/3): 39-47.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.0549      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I2/3/39
Fig.1  基于混合采样与迁移学习的多主题标签识别框架
Fig.2  混合采样过程
Fig.3  基于端到端卷积神经网络的患者评论识别模型
Fig.4  Skip-Gram词向量模型
主题名称 正例数(个) 负例数(个) IR
态度 1 313 687 1.91
能力 515 1 485 2.88
措施 841 1 159 1.38
效果 596 1 404 2.36
环境 357 1 643 4.60
费用 107 1 893 17.69
Table 1  实验数据集基本情况表
主题1 主题2 共现频次
环境 态度 190
环境 能力 54
环境 措施 116
环境 效果 72
费用 态度 51
费用 能力 25
费用 措施 62
费用 效果 31
Table 2  主题标签共现情况
参数名称 参数取值 参数含义
size 200 词向量维度
window 5 窗口大小,当前词与预测词在句中最远距离
sg 1 词向量训练模型:Skip-Gram
min_count 5 词频阈值
Table 3  词向量训练参数表
参数名称 参数取值 参数含义
filter size [1,2,3] 卷积核大小
filter number 128 卷积核数量
dropout rate 0.50-0.75 随机失活比率
l2_alpha 10 L2正则化系数
learning rate 1e-4-1e-3 随机梯度下降学习率
Table 4  卷积神经网络参数表
算法 态度 能力 措施 效果 环境 费用
SVM 0.9377 0.8083 0.6363 0.7424 0.6363 0.3792
CNN 0.9628 0.9580 0.8488 0.8090 0.8186 0.8026
CNN+MS 0.9653 0.9333 0.8427 0.8501 0.7621 0.7145
CNN+TL - - - - 0.8369 0.8375
CNN+MS+TL - - - - 0.7483 0.7554
Table 5  不同主题数据集分类模型准确率
算法 态度 能力 措施 效果 环境 费用
SVM 0.908 0.7648 0.6243 0.7097 0.6243 0.2336
CNN 0.8956 0.7998 0.8193 0.7062 0.6617 0.5337
CNN+MS 0.8957 0.8288 0.8418 0.7535 0.7339 0.6236
CNN+TL - - - - 0.6948 0.5518
CNN+MS+TL - - - - 0.8038 0.6818
Table 6  不同主题数据集分类模型召回率
算法 态度 能力 措施 效果 环境 费用
SVM 0.9221 0.8190 0.6747 0.7244 0.6195 0.2850
CNN 0.9277 0.8678 0.8322 0.7527 0.7235 0.6319
CNN+MS 0.9289 0.8764 0.8406 0.7970 0.7433 0.6541
CNN+TL - - - - 0.7560 0.6556
CNN+MS+TL - - - - 0.7724 0.7124
Table 7  不同主题数据集分类模型F1值
[1] Hao H, Zhang K, Wang W , et al. A Tale of Two Countries: International Comparison of Online Doctor Reviews Between China and the United States[J]. International Journal of Medical Informatics, 2017,99:37-44.
[2] 陈旭, 刘鹏鹤, 孙毓忠 , 等. 面向不均衡医学数据集的疾病预测模型研究[J]. 计算机学报, 2019,42(3):596-609.
[2] ( Chen Xu, Liu Penghe, Sun Yuzhong , et al. Research on Disease Prediction Models Based on Imbalanced Medical Data Sets[J]. Chinese Journal of Computers, 2019,42(3):596-609.)
[3] Johns B T, Mewhort D J K, Jones M N . The Role of Negative Information in Distributional Semantic Learning[J]. Cognitive Science, 2019,43(5):e12730.
[4] Liang H, Sun X, Sun Y , et al. Text Feature Extraction Based on Deep Learning: A Review[J]. EURASIP Journal on Wireless Communications and Networking, 2017: Article No. 211.
[5] Luque C, Luna J M, Luque M , et al. An Advanced Review on Text Mining in Medicine[J]. Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery, 2019,9(3):e1302.
[6] Lu Y, Wu Y, Liu J , et al. Understanding Health Care Social Media Use from Different Stakeholder Perspectives: A Content Analysis of an Online Health Community[J]. Journal of Medical Internet Research, 2017,19(4):e109.
[7] Hao H, Zhang K . The Voice of Chinese Health Consumers: A Text Mining Approach to Web-Based Physician Reviews[J]. Journal of Medical Internet Research, 2016,18(5):e108.
[8] Rivas R, Montazeri N, Le N X T , et al. Automatic Classification of Online Doctor Reviews: Evaluation of Text Classifier Algorithms[J]. Journal of Medical Internet Research, 2018,20(11):e11141.
[9] 金旭, 王磊, 孙国梓 , 等. 一种基于质心空间的不均衡数据欠采样方法[J]. 计算机科学, 2019,46(2):50-55.
[9] ( Jin Xu, Wang Lei, Sun Guozi , et al. Under-Sampling Method for Unbalanced Data Based on Centroid Space[J]. Computer Science, 2019,46(2):50-55.)
[10] Wilson D L . Asymptotic Properties of Nearest Neighbor Rules Using Edited Data[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1972,2(3):408-421.
[11] Kermanidis K, Maragoudakis M, Fakotakis N , et al. Learning Greek Verb Complements: Addressing the Class Imbalance [C]//Proceedings of the 20th International Conference on Computational Linguistics. 2004: 1065-1071.
[12] 古平, 欧阳源遊 . 基于混合采样的非平衡数据集分类研究[J]. 计算机应用研究, 2015,32(2):379-381.
[12] ( Gu Ping, Ouyang Yuanyou . Classification Research for Unbalanced Data Based on Mixed-Sampling[J]. Application Research of Computers, 2015,32(2):379-381.)
[13] Chawla N V, Bowyer K W, Hall L O , et al. SMOTE: Synthetic Minority Over-Sampling Technique[J]. Journal of Artificial Intelligence Research, 2002,16:321-357.
[14] Han H, Wang W Y, Mao B H . Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning [C]// Proceedings of the 2005 International Conference on Intelligent Computing. 2005: 878-887.
[15] Perez-Ortiz M, Gutierrez P A, Hervas-Martinez C . Borderline Kernel Based Over-Sampling [C]// Proceedings of the 8th International Conference on Hybrid Artificial Intelligence Systems. 2013: 472-481.
[16] Ling X, Dai W, Xue G R , et al. Spectral Domain-Transfer Learning [C]// Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2008: 488-496.
[17] Dai W, Chen Y, Xue G R , et al. Translated Learning: Transfer Learning Across Different Feature Spaces [C]// Proceedings of the 22nd Annual Conference on Neural Information Processing Systems. 2008: 353-360.
[18] Pan S J, Ni X, Sun J , et al. Cross-Domain Sentiment Classification via Spectral Feature Alignment [C]// Proceedings of the 19th International Conference on World Wide Web. 2010: 751-760.
[19] Pan S J, Kwok J T, Yang Q . Transfer Learning via Dimensionality Reduction [C]// Proceedings of the 23rd AAAI Conference on Artificial Intelligence. AAAI, 2008: 677-682.
[20] Si S, Tao D, Geng B . Bregman Divergence-Based Regularization for Transfer Subspace Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010,22(7):929-942.
[21] Bonilla E V, Chai K M A, Williams C K I . Multi-Task Gaussian Process Prediction[J]. Advances in Neural Information Processing Systems, 2008,20:153-160.
[22] Dai W Y, Yang Q, Xue G R , et al. Boosting for Transfer Learning [C]// Proceedings of the 24th International Conference on Machine Learning. 2007: 193-200.
[23] Davis J, Domingos P . Deep Transfer via Second-Order Markov Logic [C]// Proceedings of the 26th International Conference on Machine Learning. 2009: 217-224.
[24] Artem B, Victor L . Aggregating Deep Convolutional Features for Image Retrieval [C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. 2015: 1269-1277.
[25] Zhou B, Khosla A, Lapedriza A , et al. Object Detectors Emerge in Deep Scene CNNs[OL]. arXiv Preprint, arXiv:1412.6856.
[26] Jaipurkar S S, Jie W, Zeng Z , et al. Automated Classification Using End-to-End Deep Learning [C]// Proceedings of the 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. 2018: 706-709.
[27] Kim Y . Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint,arXiv:1408.5882.
[28] Mikolov T, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv:1301.3781.
[29] Alcala-Fdez J, Fernandez A, Luengo J , et al. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework[J]. Journal of Multiple-Valued Logic and Soft Computing, 2011,17:255-287.
[1] 范少萍,赵雨宣,安新颖,吴清强. 基于卷积神经网络的医学实体关系分类模型研究*[J]. 数据分析与知识发现, 2021, 5(9): 75-84.
[2] 范涛,王昊,吴鹏. 基于图卷积神经网络和依存句法分析的网民负面情感分析研究*[J]. 数据分析与知识发现, 2021, 5(9): 97-106.
[3] 陆泉, 何超, 陈静, 田敏, 刘婷. 基于两阶段迁移学习的多标签分类模型研究*[J]. 数据分析与知识发现, 2021, 5(7): 91-100.
[4] 韩普,张展鹏,张明淘,顾亮. 基于多特征融合的中文疾病名称归一化研究*[J]. 数据分析与知识发现, 2021, 5(5): 83-94.
[5] 孟镇,王昊,虞为,邓三鸿,张宝隆. 基于特征融合的声乐分类研究*[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[6] 邱尔丽,何鸿魏,易成岐,李慧颖. 基于字符级CNN技术的公共政策网民支持度研究 *[J]. 数据分析与知识发现, 2020, 4(7): 28-37.
[7] 刘伟江,魏海,运天鹤. 基于卷积神经网络的客户信用评估模型研究*[J]. 数据分析与知识发现, 2020, 4(6): 80-90.
[8] 赵平,孙连英,涂帅,卞建玲,万莹. 改进的知识迁移景点实体识别算法研究及应用*[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
[9] 刘彤,倪维健,孙宇健,曾庆田. 基于深度迁移学习的业务流程实例剩余执行时间预测方法*[J]. 数据分析与知识发现, 2020, 4(2/3): 134-142.
[10] 徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[11] 彭郴,吕学强,孙宁,张乐,姜肇财,宋黎. 基于CNN的消费品缺陷领域词典构建方法研究*[J]. 数据分析与知识发现, 2020, 4(11): 112-120.
[12] 王树义,刘赛,马峥. 基于深度迁移学习的微博图像隐私分类研究*[J]. 数据分析与知识发现, 2020, 4(10): 80-92.
[13] 聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 *[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[14] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[15] 刘勘,陈露. 面向医疗分诊的深度神经网络学习*[J]. 数据分析与知识发现, 2019, 3(6): 99-108.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn