Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (7): 66-75    DOI: 10.11925/infotech.2096-3467.2019.1299
Current Issue | Archive | Adv Search |
Classification of Health Questions Based on Vector Extension of Keywords
Tang Xiaobo1,2,Gao Hexuan1()
1School of Information Management, Wuhan University, Wuhan 430072, China
2Center for Studies of Information Systems, Wuhan University, Wuhan 430072, China
Download: PDF (976 KB)   HTML ( 11
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a classification model for health questions based on keywords vector expansion, aiming to improve the user experience of medical question-answering community.[Methods] First, we extracted keywords from the questions using TF-IDF and LDA models.Then, we extended the word vector features with Word2Vec and applied them to the classification of health questions.[Results] The proposed method yielded better classification results with the TF-IDF as keyword extraction method and the complete questions/answers as training corpus. The number of words in the reserved dictionary was 600, and the language model was CBOW. The values of our optimal model’s P, R, F were 0.987 2, 0.972 5 and 0.979 8 respectively.[Limitations] We did not extracted keywords of short medical texts with semantic depth.[Conclusions] Our new classification model has better performance than the existing ones.

Key wordsFeature Expansion      Classification of Short Texts      Word2Vec      TF-IDF     
Received: 04 December 2019      Published: 25 July 2020
ZTFLH:  TP391  
Corresponding Authors: Gao Hexuan     E-mail: gaohexuan@whu.edu.com

Cite this article:

Tang Xiaobo,Gao Hexuan. Classification of Health Questions Based on Vector Extension of Keywords. Data Analysis and Knowledge Discovery, 2020, 4(7): 66-75.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.1299     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I7/66

The Health Questions Classification Model Based on the Vector Feature Extension of Keywords
The Structure of LDA
序号 分类 数量 序号 分类 数量
1 牛皮癣 14 127 9 酒渣鼻 153
2 白癜风 20 984 10 灰指甲 686
3 荨麻疹 2 362 11 花斑癣 221
4 鱼鳞病 696 12 腋下多汗 360
5 脱发 2 515 13 银屑病 1 758
6 湿疹 2 122 14 头癣 340
7 腋臭 2 492 15 狐臭 523
8 带状疱疹 1 000
Classification and Quantity Distribution of Health Questions
Perplexity and Topic Numbers Curve
语料库 CBOW skip-gram
P R F P R F
维基百科中文语料 0.945 4 0.912 4 0.939 4 0.948 5 0.914 9 0.931 4
健康问句语料库 0.953 5 0.921 2 0.944 5 0.956 3 0.923 4 0.939 6
健康问句医生回答语料库 0.950 4 0.918 8 0.950 7 0.953 3 0.920 2 0.936 5
问答全集语料库 0.956 7 0.925 4 0.950 4 0.959 6 0.927 5 0.943 3
Classification Effects of Different Corpus
保留词典
词语数(个)
CBOW skip-gram
P R F P R F
300 0.985 6 0.966 9 0.976 1 `0.981 7 0.951 6 0.966 4
600 0.987 2 0.972 5 0.979 8 0.983 2 0.952 0 0.967 3
1 200 0.985 6 0.970 1 0.977 8 0.980 8 0.950 5 0.965 4
1 800 0.984 7 0.966 6 0.975 6 0.980 4 0.948 6 0.964 2
2 400 0.981 4 0.964 4 0.972 8 0.978 8 0.941 5 0.959 8
Classification Effects after Extending Health Questions(TF-IDF)
保留词典
词语数(个)
CBOW skip-gram
P R F P R F
300 0.954 3 0.925 0 0.939 4 0.931 4 0.921 4 0.926 4
600 0.957 2 0.932 1 0.944 5 0.954 7 0.928 8 0.941 6
1 200 0.961 5 0.940 1 0.950 7 0.958 8 0.936 7 0.947 6
1 800 0.965 9 0.941 6 0.953 6 0.963 3 0.938 3 0.950 6
2 400 0.962 1 0.938 9 0.950 4 0.960 1 0.937 2 0.948 5
Classification Effects after Extending Health Questions(LDA)
模型 P R F
SVM 0.945 7 0.939 1 0.942 4
未进行扩展的CNN 0.959 6 0.927 5 0.943 3
LDA提取关键词后扩展词向量特征 0.965 9 0.941 6 0.953 6
TF-IDF提取关键词后扩展词向量特征 0.987 2 0.972 5 0.979 8
Classification Effects of Different Models
[1] 国家中医药管理局. 关于深入开展“互联网+医疗健康”便民惠民活动的通知[EB/OL] [ 2018- 08- 12]. http://gcs.satcm.gov.cn/zhengcewenjian/2018-07-18/7410.html.
[1] ( The “Internet+Medical and Health Convenience and Benefited Activities” Printed and Distributed by National Health Commission[EB/OL]. [ 2018- 08- 12]. http://gcs.satcm.gov.cn/zhengcewenjian/2018-07-18/7410.html
[2] Wang X, Zuo Z Y, Zhao K. The Evolution of User Roles in Online Health Communities-a Social Support Perspective[C] //Proceedings of Pacific Asia Conference on Information Systems. 2015: 48-56.
[3] Dogan T, Uysal A K. Improved Inverse Gravity Moment Term Weighting for Text Classification[J]. Expert Systems with Applications, 2019,130:45-59.
[4] 雷朔, 刘旭敏, 徐维祥. 基于词向量特征扩展的中文短文本分类研究[J]. 计算机应用与软件, 2018,35(8):269-274.
[4] ( Lei Shuo, Liu Xumin, Xu Weixiang. Chinese Short Text Classification Based on Word Vector Extension[J]. Computer Applications and Software, 2018,35(8):269-274.)
[5] 曾庆田, 胡晓慧, 李超. 融合主题词嵌入和网络结构分析的主题关键词提取方法[J]. 数据分析与知识发现, 2019,3(7):52-60.
[5] ( Zeng Qingtian, Hu Xiaohui, Li Chao. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. Data Analysis and Knowledge Discovery, 2019,3(7):52-60.)
[6] 夏威. 基于深度学习模型的问题分类[D]. 长沙: 湖南大学, 2018.
[6] ( Xia Wei. Question Classification Based on Deep Learning Model[D]. Changsha: Hunan University, 2018.)
[7] 陈科文, 张祖平, 龙军. 文本分类中基于熵的词权重计算方法研究[J]. 计算机科学与探索, 2016,10(9):1299-1309.
[7] ( Chen Kewen, Zhang Zuping, Long Jun. Research on Entropy-Based Term Weighting Methods in Text Categorization[J]. Journal of Frontiers of Computer Science and Technology, 2016,10(9):1299-1309.)
[8] 王祥翔, 方荟, 陈崇成. 基于朴素贝叶斯的文化旅游文本分类技术研究[J]. 福州大学学报(自然科学版), 2018,46(5):644-649.
[8] ( Wang Xiangxiang, Fang Hui, Chen Chongcheng. Classification Technique of Cultural Tourism Text Based on Naive Bayes[J]. Journal of Fuzhou University (Natural Science Edition), 2018,46(5):644-649.)
[9] 王东波, 何琳, 黄水清. 基于支持向量机的先秦诸子典籍自动分类研究[J]. 图书情报工作, 2017,61(12):71-76.
[9] ( Wang Dongbo, He Lin, Huang Shuiqing. Research of Automatic Classification for Pre-Qin Philosophers Literature Based on the Support Vector Machine[J]. Library and Information Service, 2017,61(12):71-76.)
[10] Rushdi M, Saleh M T, Martín V A, et al. Experiments with SVM to Classify Opinions in Different Domains[J]. Expert Systems with Applications, 2011,38(12):14799-14804.
[11] Huang G, Li Y, Wang Q, et al. Automatic Classification Method for Software Vulnerability Based on Deep Neural Network[J]. IEEE Access, 2019.DOI: 10.1109/ACCESS.2019.2900462.
doi: 10.1109/ACCESS.2019.2894092 pmid: 31741809
[12] 吕超镇, 姬东鸿, 吴飞飞. 基于LDA特征扩展的短文本分类[J]. 计算机工程与应用, 2015,51(4):123-127.
[12] ( Lv Chaozhen, Ji Donghong, Wu Feifei. Short Text Classification Based on Expanding Feature of LDA[J]. Computer Engineering and Applications, 2015,51(4):123-127.)
[13] 黄贤英, 谢晋, 龙姝言. 融合词向量及BTM模型的问题分类方法[J]. 计算机工程与设计, 2019,40(2):91-95.
[13] ( Huang Xianying, Xie Jin, Long Shuyan. Question Classification Method Combining Word Vector and BTM Model[J]. Computer Engineering and Design, 2019,40(2):91-95.)
[14] Luo L. Network Text Sentiment Analysis Method Combining LDA Text Representation and GRU-CNN[J]. Personal and Ubiquitous Computing, 2019,23(3-4):405-412.
[15] De Boom C, Van Canneyt S, Demeester T, et al. Representation Learning for Very Short Texts Using Weighted Word Embedding Aggregation[J]. Pattern Recognition Letters, 2016,80:150-156.
[16] 蔡慧苹. 基于卷积神经网络的短文本分类方法研究[D]. 重庆: 西南大学, 2016.
[16] ( Cai Huiping. Research of Short-text Classification Method Based on Convolution Neural Network[D]. Chongqing: Southwest University, 2016.)
[17] Mikolov T, Zweig G. Context Dependent Recurrent Neural Network Language Model[C] //Proceedings of the 2012 IEEE Workshop on Spoken Language Technology. 2012: 234-239.
[18] 杨开平. 基于语义相似度的中文文本聚类算法研究[D]. 成都: 电子科技大学, 2018.
[18] ( Yang Kaiping. Study on the Chinese Text Clustering Algorithm Based on Semantic Similarity[D]. Chengdu: University of Electronic Science and Technology of China, 2018.)
[19] 谢志峰, 吴佳萍, 马利庄. 基于卷积神经网络的中文财经新闻分类方法[J]. 山东大学学报(工学版), 2018,48(3):34-39.
[19] ( Xie Zhifeng, Wu Jiaping, Ma Lizhuang. Chinese Financial News Classification Method Based on Convolutional Neural Network[J]. Journal of Shandong University (Engineering Science), 2018,48(3):34-39.)
[20] 张闯. 基于深度学习的知乎标题的多标签文本分类[D]. 北京: 北京交通大学, 2018.
[20] ( Zhang Chuang. Multi-Label Text Categorization of Zhihu Title Based on Deep Learning[D]. Beijing: Beijing Jiaotong University, 2018.)
[21] Christodoulou V, Filgueira R, Bee E, et al. Automatic Classification of Aurora-related Tweets Using Machine Learning Methods[C] //Proceedings of the 2nd International Conference on Geoinformatics and Data Analysis. 2019: 115-119.
[22] Yang Z, Fan K F, Lai X X, et al. Short Texts Classification Through Reference Document Expansion[J]. Chinese Journal of Electronics, 2014,32(2):315-321.
[23] Li X, Gao F, Ding C. The Research of Chinese Short-text Classification Based on Domain Keyword Set Extension HowNet[C] //Proceedings of the International Conference on Intelligent and Control and Computer Application. 2016: 244-247.
[24] 邵云飞, 刘东苏. 基于类别特征扩展的短文本分类方法研究[J]. 数据分析与知识发现, 2019,3(9):60-67.
[24] ( Shao Yunfei, Liu Dongsu. Classifying Short-texts with Class Feature Extension[J]. Data Analysis and Knowledge Discovery, 2019,3(9):60-67.)
[25] 靳一凡, 傅颖勋, 马礼. 基于频繁项特征扩展的短文本分类方法[J]. 计算机科学, 2019,46(S1):478-481.
[25] ( Jin Yifan, Fu Yingxun, Ma Li. Method of Short Text Classification Based on Frequent Item Feature Extension[J]. Computer Science, 2019,46(S1):478-481.)
[26] 张群, 王红军, 王伦文. 词向量与LDA相融合的短文本分类方法[J]. 现代图书情报技术, 2016(12):27-35.
[26] ( Zhang Qun, Wang Hongjun, Wang Lunwen. Classifying Short Texts with Word Embedding and LDA Model[J]. New Technology of Library and Information Service, 2016(12):27-35.)
[27] Sun F, Chen H. Feature Extension for Chinese Short Text Classification Based on LDA and Word2vec[C] //Proceedings of the 2018 13th IEEE Conference on Industrial Electronics and Applications (ICIEA). 2018: 1189-1194.
[28] 李杰, 李欢. 基于深度学习的短文本评论产品特征提取及情感分类研究[J]. 情报理论与实践, 2018,41(2):143-148.
[28] ( Li Jie, Li Huan. Research on Product Feature Extraction and Sentiment Classification of Short Online Review Based on Deep Learning[J]. Information Studies: Theory & Application, 2018,41(2):143-148.)
[29] 邱宁佳, 丛琳, 周思丞, 等. 结合改进主动学习的SVD-CNN弹幕文本分类算法[J]. 计算机应用, 2019,39(3):644-650.
[29] ( Qiu Ningjia, Cong Lin, Zhou Sicheng, et al. SVD-CNN Barrage Text Classification Algorithm Combined with Improved Active Learning[J]. Journal of Computer Applications, 2019,39(3):644-650.)
[30] 周飞燕, 金林鹏, 董军. 卷积神经网络研究综述[J]. 计算机学报, 2017,40(6):1229-1251.
[30] ( Zhou Feiyan, Jin Linpeng, Dong Jun. Review of Convolutional Neural Network[J]. Chinese Journal of Computers, 2017,40(6):1229-1251.)
[31] 何明月, 赵桂华. 慢性皮肤病患者生活质量的研究进展[J]. 护理实践与研究, 2013,10(5):118-119.
[31] ( He Mingyue, Zhao Guihua. Research Progress of Quality of Life in Patients with Chronic Dermatosis[J]. Nursing Practice and Research, 2013,10(5):118-119.)
[32] 王茂全. 深度特征学习在句子文本分类中的研究及应用[D]. 上海: 华东师范大学, 2018.
[32] ( Wang Maoquan. Study and Application of Deep Learning in Sentence-level Text Classification[D]. Shanghai: East China Normal University, 2018.)
[1] Ye Jiaxin,Xiong Huixiang,Tong Zhaoli,Meng Qiuqing. Collaborative Tagging for Doctors in Online Medical Community[J]. 数据分析与知识发现, 2020, 4(6): 118-128.
[2] Yue Lixin,Liu Ziqiang,Hu Zhengyin. Evolution Analysis of Hot Topics with Trend-Prediction[J]. 数据分析与知识发现, 2020, 4(6): 22-34.
[3] Tao Xing,Zhang Xiangxian,Guo Shunli,Zhang Liman. Automatic Summarization of User-Generated Content in Academic Q&A Community Based on Word2Vec and MMR[J]. 数据分析与知识发现, 2020, 4(4): 109-118.
[4] Ye Jiaxin,Xiong Huixiang,Jiang Wuxuan. A Physician Recommendation Algorithm Integrating Inquiries and Decisions of Patients[J]. 数据分析与知识发现, 2020, 4(2/3): 153-164.
[5] Xue Fuliang,Liu Lifang. Fine-Grained Sentiment Analysis with CRF and ATAE-LSTM[J]. 数据分析与知识发现, 2020, 4(2/3): 207-213.
[6] Gong Lijuan,Wang Hao,Zhang Zixuan,Zhu Liping. Reducing Dimensions of Custom Declaration Texts with Word2Vec[J]. 数据分析与知识发现, 2020, 4(2/3): 89-100.
[7] Cuiqing Jiang,Yibo Guo,Yao Liu. Constructing a Domain Sentiment Lexicon Based on Chinese Social Media Text[J]. 数据分析与知识发现, 2019, 3(2): 98-107.
[8] Xinlei Li,Hao Wang,Xiaomin Liu,Sanhong Deng. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[9] Cong Yin,Liyi Zhang. Recommendation Algorithm for Post-Context Filtering Based on TF-IDF: Case Study of Catering O2O[J]. 数据分析与知识发现, 2018, 2(11): 28-36.
[10] Yongbing Gao,Guipeng Yang,Di Zhang,Zhanfei Ma. Detecting Events from Official Weibo Profiles Based on Post Clustering with Burst Words[J]. 数据分析与知识发现, 2017, 1(9): 57-64.
[11] Qin Zhang,Hongmei Guo,Zhixiong Zhang. Extracting Entity Relationship with Word Embedding Representation Features[J]. 数据分析与知识发现, 2017, 1(9): 8-15.
[12] Changbing Li,Chongpeng Pang,Meiping Li. Extracting Product Features with Weight-based Apriori Algorithm[J]. 数据分析与知识发现, 2017, 1(9): 83-89.
[13] Yue He,Min Xiao,Yue Zhang. Sentiment Analysis of Trending Topics Based on Relevance[J]. 数据分析与知识发现, 2017, 1(3): 46-53.
[14] Tian Xia. Extracting Keywords with Modified TextRank Model[J]. 数据分析与知识发现, 2017, 1(2): 28-34.
[15] Ruilun Liu,Wenhao Ye,Ruiqing Gao,Mengjia Tang,Dongbo Wang. Research on Text Clustering Based on Requirements of Big Data Jobs[J]. 数据分析与知识发现, 2017, 1(12): 32-40.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn