Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (2/3): 39-47    DOI: 10.11925/infotech.2096-3467.2019.0549
Recognition Model of Patient Reviews Based on Mixed Sampling and Transfer Learning
Xiang Fei(),Xie Yaotan
School of Medicine and Health Management, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
[Objective] This study proposes a new convolutional neural network model, aiming to process the imbalanced data of online patient reviews.[Methods] First, we established the new model with mixed sampling and transfer learning techniques. Then we used end-to-end deep learning architecture based on Word2Vector and convolutional neural network for the distributed representation, feature extraction and topic classification of online patient reviews.[Results] Compared with traditional machine learning algorithm represented by SVM and single convolutional neural network, the proposed model significantly improved the accuracy, recall and F1 values.[Limitations] The imbalanced data of this study was only from online patient reviews.[Conclusions] The proposed model could effectively improve the recognition results of imbalanced data.

Key wordsMixed Sampling      Transfer Learning      Imbalanced Data      Convolutional Neural Network      Patient Reviews Recognition     
Received: 24 May 2019      Published: 26 April 2020
Xiang Fei,Xie Yaotan. Recognition Model of Patient Reviews Based on Mixed Sampling and Transfer Learning. Data Analysis and Knowledge Discovery, 2020, 4(2/3): 39-47.

Recognition Framework of Multi-label Data Based on Mixed Sampling and Transfer Learning
Mixed Sampling Process
Patient Reviews Recognition Model Based on End-to-End CNN
Skip-Gram Model
主题名称 正例数(个) 负例数(个) IR
态度 1 313 687 1.91
能力 515 1 485 2.88
措施 841 1 159 1.38
效果 596 1 404 2.36
环境 357 1 643 4.60
费用 107 1 893 17.69
Description of Experimental Data Set
主题1 主题2 共现频次
环境 态度 190
环境 能力 54
环境 措施 116
环境 效果 72
费用 态度 51
费用 能力 25
费用 措施 62
费用 效果 31
Co-occurrence of Topic Labels
参数名称 参数取值 参数含义
size 200 词向量维度
window 5 窗口大小,当前词与预测词在句中最远距离
sg 1 词向量训练模型:Skip-Gram
min_count 5 词频阈值
Parameters of Word2Vec Training
参数名称 参数取值 参数含义
filter size [1,2,3] 卷积核大小
filter number 128 卷积核数量
dropout rate 0.50-0.75 随机失活比率
l2_alpha 10 L2正则化系数
learning rate 1e-4-1e-3 随机梯度下降学习率
Parameters of CNN
算法 态度 能力 措施 效果 环境 费用
SVM 0.9377 0.8083 0.6363 0.7424 0.6363 0.3792
CNN 0.9628 0.9580 0.8488 0.8090 0.8186 0.8026
CNN+MS 0.9653 0.9333 0.8427 0.8501 0.7621 0.7145
CNN+TL - - - - 0.8369 0.8375
CNN+MS+TL - - - - 0.7483 0.7554
Accuracy of Classification Models for Different Topic Datasets
算法 态度 能力 措施 效果 环境 费用
SVM 0.908 0.7648 0.6243 0.7097 0.6243 0.2336
CNN 0.8956 0.7998 0.8193 0.7062 0.6617 0.5337
CNN+MS 0.8957 0.8288 0.8418 0.7535 0.7339 0.6236
CNN+TL - - - - 0.6948 0.5518
CNN+MS+TL - - - - 0.8038 0.6818
Recall of Classification Models for Different Topic Datasets
算法 态度 能力 措施 效果 环境 费用
SVM 0.9221 0.8190 0.6747 0.7244 0.6195 0.2850
CNN 0.9277 0.8678 0.8322 0.7527 0.7235 0.6319
CNN+MS 0.9289 0.8764 0.8406 0.7970 0.7433 0.6541
CNN+TL - - - - 0.7560 0.6556
CNN+MS+TL - - - - 0.7724 0.7124
F1 Value of Classification Models for Different Topic Datasets
