Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (2/3): 39-47    DOI: 10.11925/infotech.2096-3467.2019.0549
  专辑 本期目录 | 过刊浏览 | 高级检索 |
基于混合采样与迁移学习的患者评论识别模型*
向菲(),谢耀谈
华中科技大学同济医学院医药卫生管理学院 武汉 430030
Recognition Model of Patient Reviews Based on Mixed Sampling and Transfer Learning
Xiang Fei(),Xie Yaotan
School of Medicine and Health Management, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
全文: PDF(890 KB)   HTML ( 5
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 针对患者评论中的不均衡样本数据,提出一种基于混合采样与迁移学习的端到端的卷积神经网络模型。【方法】 采用混合采样与迁移学习的方法解决样本不均衡问题,并利用Word2Vec与卷积神经网络相结合的端到端深度学习架构对患者评论文本进行分布式表示、特征提取以及主题分类。【结果】 采用混合采样与迁移学习的主题识别模型相比,以SVM为代表的传统机器学习模型以及单一卷积神经网络模型在准确率、召回率以及F1值上有明显提升。【局限】 本研究的不均衡样本仅针对在线患者评论文本。【结论】 本研究提出的基于混合采样与迁移学习的患者评论识别模型在应对不均衡样本问题时能够有效提升患者评论识别效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
向菲
谢耀谈
关键词 混合采样迁移学习不均衡样本卷积神经网络患者评论识别    
Abstract

[Objective] This study proposes a new convolutional neural network model, aiming to process the imbalanced data of online patient reviews.[Methods] First, we established the new model with mixed sampling and transfer learning techniques. Then we used end-to-end deep learning architecture based on Word2Vector and convolutional neural network for the distributed representation, feature extraction and topic classification of online patient reviews.[Results] Compared with traditional machine learning algorithm represented by SVM and single convolutional neural network, the proposed model significantly improved the accuracy, recall and F1 values.[Limitations] The imbalanced data of this study was only from online patient reviews.[Conclusions] The proposed model could effectively improve the recognition results of imbalanced data.

Key wordsMixed Sampling    Transfer Learning    Imbalanced Data    Convolutional Neural Network    Patient Reviews Recognition
收稿日期: 2019-05-24     
中图分类号:  TP393  
基金资助:*本文系华中科技大学自主创新基金项目“社区健康信息空间构建模式与服务设计”(2014AA034);中央高校基本科研业务费资助项目的研究成果之一
通讯作者: 向菲     E-mail: xiangfei@hust.edu.cn
引用本文:   
向菲,谢耀谈. 基于混合采样与迁移学习的患者评论识别模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 39-47.
Xiang Fei,Xie Yaotan. Recognition Model of Patient Reviews Based on Mixed Sampling and Transfer Learning. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2019.0549.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.0549
图1  基于混合采样与迁移学习的多主题标签识别框架
图2  混合采样过程
图3  基于端到端卷积神经网络的患者评论识别模型
图4  Skip-Gram词向量模型
主题名称 正例数(个) 负例数(个) IR
态度 1 313 687 1.91
能力 515 1 485 2.88
措施 841 1 159 1.38
效果 596 1 404 2.36
环境 357 1 643 4.60
费用 107 1 893 17.69
表1  实验数据集基本情况表
主题1 主题2 共现频次
环境 态度 190
环境 能力 54
环境 措施 116
环境 效果 72
费用 态度 51
费用 能力 25
费用 措施 62
费用 效果 31
表2  主题标签共现情况
参数名称 参数取值 参数含义
size 200 词向量维度
window 5 窗口大小,当前词与预测词在句中最远距离
sg 1 词向量训练模型:Skip-Gram
min_count 5 词频阈值
表3  词向量训练参数表
参数名称 参数取值 参数含义
filter size [1,2,3] 卷积核大小
filter number 128 卷积核数量
dropout rate 0.50-0.75 随机失活比率
l2_alpha 10 L2正则化系数
learning rate 1e-4-1e-3 随机梯度下降学习率
表4  卷积神经网络参数表
算法 态度 能力 措施 效果 环境 费用
SVM 0.9377 0.8083 0.6363 0.7424 0.6363 0.3792
CNN 0.9628 0.9580 0.8488 0.8090 0.8186 0.8026
CNN+MS 0.9653 0.9333 0.8427 0.8501 0.7621 0.7145
CNN+TL - - - - 0.8369 0.8375
CNN+MS+TL - - - - 0.7483 0.7554
表5  不同主题数据集分类模型准确率
算法 态度 能力 措施 效果 环境 费用
SVM 0.908 0.7648 0.6243 0.7097 0.6243 0.2336
CNN 0.8956 0.7998 0.8193 0.7062 0.6617 0.5337
CNN+MS 0.8957 0.8288 0.8418 0.7535 0.7339 0.6236
CNN+TL - - - - 0.6948 0.5518
CNN+MS+TL - - - - 0.8038 0.6818
表6  不同主题数据集分类模型召回率
算法 态度 能力 措施 效果 环境 费用
SVM 0.9221 0.8190 0.6747 0.7244 0.6195 0.2850
CNN 0.9277 0.8678 0.8322 0.7527 0.7235 0.6319
CNN+MS 0.9289 0.8764 0.8406 0.7970 0.7433 0.6541
CNN+TL - - - - 0.7560 0.6556
CNN+MS+TL - - - - 0.7724 0.7124
表7  不同主题数据集分类模型F1值
[1] Hao H, Zhang K, Wang W , et al. A Tale of Two Countries: International Comparison of Online Doctor Reviews Between China and the United States[J]. International Journal of Medical Informatics, 2017,99:37-44.
[2] 陈旭, 刘鹏鹤, 孙毓忠 , 等. 面向不均衡医学数据集的疾病预测模型研究[J]. 计算机学报, 2019,42(3):596-609.
( Chen Xu, Liu Penghe, Sun Yuzhong , et al. Research on Disease Prediction Models Based on Imbalanced Medical Data Sets[J]. Chinese Journal of Computers, 2019,42(3):596-609.)
[3] Johns B T, Mewhort D J K, Jones M N . The Role of Negative Information in Distributional Semantic Learning[J]. Cognitive Science, 2019,43(5):e12730.
[4] Liang H, Sun X, Sun Y , et al. Text Feature Extraction Based on Deep Learning: A Review[J]. EURASIP Journal on Wireless Communications and Networking, 2017: Article No. 211.
[5] Luque C, Luna J M, Luque M , et al. An Advanced Review on Text Mining in Medicine[J]. Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery, 2019,9(3):e1302.
[6] Lu Y, Wu Y, Liu J , et al. Understanding Health Care Social Media Use from Different Stakeholder Perspectives: A Content Analysis of an Online Health Community[J]. Journal of Medical Internet Research, 2017,19(4):e109.
[7] Hao H, Zhang K . The Voice of Chinese Health Consumers: A Text Mining Approach to Web-Based Physician Reviews[J]. Journal of Medical Internet Research, 2016,18(5):e108.
[8] Rivas R, Montazeri N, Le N X T , et al. Automatic Classification of Online Doctor Reviews: Evaluation of Text Classifier Algorithms[J]. Journal of Medical Internet Research, 2018,20(11):e11141.
[9] 金旭, 王磊, 孙国梓 , 等. 一种基于质心空间的不均衡数据欠采样方法[J]. 计算机科学, 2019,46(2):50-55.
( Jin Xu, Wang Lei, Sun Guozi , et al. Under-Sampling Method for Unbalanced Data Based on Centroid Space[J]. Computer Science, 2019,46(2):50-55.)
[10] Wilson D L . Asymptotic Properties of Nearest Neighbor Rules Using Edited Data[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1972,2(3):408-421.
[11] Kermanidis K, Maragoudakis M, Fakotakis N , et al. Learning Greek Verb Complements: Addressing the Class Imbalance [C]//Proceedings of the 20th International Conference on Computational Linguistics. 2004: 1065-1071.
[12] 古平, 欧阳源遊 . 基于混合采样的非平衡数据集分类研究[J]. 计算机应用研究, 2015,32(2):379-381.
( Gu Ping, Ouyang Yuanyou . Classification Research for Unbalanced Data Based on Mixed-Sampling[J]. Application Research of Computers, 2015,32(2):379-381.)
[13] Chawla N V, Bowyer K W, Hall L O , et al. SMOTE: Synthetic Minority Over-Sampling Technique[J]. Journal of Artificial Intelligence Research, 2002,16:321-357.
[14] Han H, Wang W Y, Mao B H . Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning [C]// Proceedings of the 2005 International Conference on Intelligent Computing. 2005: 878-887.
[15] Perez-Ortiz M, Gutierrez P A, Hervas-Martinez C . Borderline Kernel Based Over-Sampling [C]// Proceedings of the 8th International Conference on Hybrid Artificial Intelligence Systems. 2013: 472-481.
[16] Ling X, Dai W, Xue G R , et al. Spectral Domain-Transfer Learning [C]// Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2008: 488-496.
[17] Dai W, Chen Y, Xue G R , et al. Translated Learning: Transfer Learning Across Different Feature Spaces [C]// Proceedings of the 22nd Annual Conference on Neural Information Processing Systems. 2008: 353-360.
[18] Pan S J, Ni X, Sun J , et al. Cross-Domain Sentiment Classification via Spectral Feature Alignment [C]// Proceedings of the 19th International Conference on World Wide Web. 2010: 751-760.
[19] Pan S J, Kwok J T, Yang Q . Transfer Learning via Dimensionality Reduction [C]// Proceedings of the 23rd AAAI Conference on Artificial Intelligence. AAAI, 2008: 677-682.
[20] Si S, Tao D, Geng B . Bregman Divergence-Based Regularization for Transfer Subspace Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010,22(7):929-942.
[21] Bonilla E V, Chai K M A, Williams C K I . Multi-Task Gaussian Process Prediction[J]. Advances in Neural Information Processing Systems, 2008,20:153-160.
[22] Dai W Y, Yang Q, Xue G R , et al. Boosting for Transfer Learning [C]// Proceedings of the 24th International Conference on Machine Learning. 2007: 193-200.
[23] Davis J, Domingos P . Deep Transfer via Second-Order Markov Logic [C]// Proceedings of the 26th International Conference on Machine Learning. 2009: 217-224.
[24] Artem B, Victor L . Aggregating Deep Convolutional Features for Image Retrieval [C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. 2015: 1269-1277.
[25] Zhou B, Khosla A, Lapedriza A , et al. Object Detectors Emerge in Deep Scene CNNs[OL]. arXiv Preprint, arXiv:1412.6856.
[26] Jaipurkar S S, Jie W, Zeng Z , et al. Automated Classification Using End-to-End Deep Learning [C]// Proceedings of the 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. 2018: 706-709.
[27] Kim Y . Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint,arXiv:1408.5882.
[28] Mikolov T, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv:1301.3781.
[29] Alcala-Fdez J, Fernandez A, Luengo J , et al. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework[J]. Journal of Multiple-Valued Logic and Soft Computing, 2011,17:255-287.
[1] 刘彤,倪维健,孙宇健,曾庆田. 基于深度迁移学习的业务流程实例剩余执行时间预测方法*[J]. 数据分析与知识发现, 2020, 4(2/3): 134-142.
[2] 徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[3] 聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 *[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[4] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[5] 刘勘,陈露. 面向医疗分诊的深度神经网络学习*[J]. 数据分析与知识发现, 2019, 3(6): 99-108.
[6] 陈美杉,夏晨曦. 肝癌患者在线提问的命名实体识别研究:一种基于迁移学习的方法 *[J]. 数据分析与知识发现, 2019, 3(12): 61-69.
[7] 徐月梅,吕思凝,蔡连侨,张小娅. 结合卷积神经网络和Topic2Vec的新闻主题演变分析*[J]. 数据分析与知识发现, 2018, 2(9): 31-41.
[8] 伍杰华,沈静,周蓓. 基于迁移成分分析的多层社交网络链接分类*[J]. 数据分析与知识发现, 2018, 2(9): 88-99.
[9] 黄孝喜,李晗雨,王荣波,王小华,谌志群. 基于卷积神经网络与SVM分类器的隐喻识别*[J]. 数据分析与知识发现, 2018, 2(10): 77-83.
[10] 张志武. 跨领域迁移学习产品评论情感分析[J]. 现代图书情报技术, 2013, (6): 49-54.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn