Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (7): 66-75     https://doi.org/10.11925/infotech.2096-3467.2019.1299
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于关键词词向量特征扩展的健康问句分类研究 *
唐晓波1,2,高和璇1()
1武汉大学信息管理学院 武汉 430072
2武汉大学信息系统研究中心 武汉 430072
Classification of Health Questions Based on Vector Extension of Keywords
Tang Xiaobo1,2,Gao Hexuan1()
1School of Information Management, Wuhan University, Wuhan 430072, China
2Center for Studies of Information Systems, Wuhan University, Wuhan 430072, China
全文: PDF (976 KB)   HTML ( 10
输出: BibTeX | EndNote (RIS)      
摘要 

目的】基于医疗问答社区中的健康问句数据,提出基于关键词词向量特征扩展的健康问句分类模型,提升健康问句的分类效率,帮助医疗问答社区提高患者使用满意度。【方法】分别使用TF-IDF和LDA提取关键词,使用Word2Vec对关键词进行词向量特征扩展,并将其应用于医疗问答社区中的健康问句分类。【结果】本模型可以有效地提升健康问句分类的效果。当关键词提取方式为TF-IDF、训练词向量的语料库为问答全集语料库、保留词典中词语数为600、语言模型为CBOW时,准确率、召回率、F值分别为0.987 2、0.972 5、0.979 8,分类效果最优。【局限】 未在语义层面深度提取医学短文本关键词。【结论】基于关键词词向量特征扩展的健康问句分类模型在健康问句分类方面与现有分类方法相比具有更好的分类效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
唐晓波
高和璇
关键词 特征扩展短文本分类Word2VecTF-IDF    
Abstract

[Objective] This paper proposes a classification model for health questions based on keywords vector expansion, aiming to improve the user experience of medical question-answering community.[Methods] First, we extracted keywords from the questions using TF-IDF and LDA models.Then, we extended the word vector features with Word2Vec and applied them to the classification of health questions.[Results] The proposed method yielded better classification results with the TF-IDF as keyword extraction method and the complete questions/answers as training corpus. The number of words in the reserved dictionary was 600, and the language model was CBOW. The values of our optimal model’s P, R, F were 0.987 2, 0.972 5 and 0.979 8 respectively.[Limitations] We did not extracted keywords of short medical texts with semantic depth.[Conclusions] Our new classification model has better performance than the existing ones.

Key wordsFeature Expansion    Classification of Short Texts    Word2Vec    TF-IDF
收稿日期: 2019-12-04      出版日期: 2020-07-25
ZTFLH:  TP391  
基金资助:*本文系国家自然科学基金项目“基于文本和Web语义分析的智能资讯服务研究”的研究成果之一(71673209)
通讯作者: 高和璇     E-mail: gaohexuan@whu.edu.com
引用本文:   
唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
Tang Xiaobo,Gao Hexuan. Classification of Health Questions Based on Vector Extension of Keywords. Data Analysis and Knowledge Discovery, 2020, 4(7): 66-75.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.1299      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I7/66
Fig.1  基于关键词词向量特征扩展的健康问句分类模型
Fig.2  LDA结构
序号 分类 数量 序号 分类 数量
1 牛皮癣 14 127 9 酒渣鼻 153
2 白癜风 20 984 10 灰指甲 686
3 荨麻疹 2 362 11 花斑癣 221
4 鱼鳞病 696 12 腋下多汗 360
5 脱发 2 515 13 银屑病 1 758
6 湿疹 2 122 14 头癣 340
7 腋臭 2 492 15 狐臭 523
8 带状疱疹 1 000
Table 1  健康问句分类及数量分布
Fig.3  困惑度-主题数曲线
语料库 CBOW skip-gram
P R F P R F
维基百科中文语料 0.945 4 0.912 4 0.939 4 0.948 5 0.914 9 0.931 4
健康问句语料库 0.953 5 0.921 2 0.944 5 0.956 3 0.923 4 0.939 6
健康问句医生回答语料库 0.950 4 0.918 8 0.950 7 0.953 3 0.920 2 0.936 5
问答全集语料库 0.956 7 0.925 4 0.950 4 0.959 6 0.927 5 0.943 3
Table 2  不同语料库分类效果对比
保留词典
词语数(个)
CBOW skip-gram
P R F P R F
300 0.985 6 0.966 9 0.976 1 `0.981 7 0.951 6 0.966 4
600 0.987 2 0.972 5 0.979 8 0.983 2 0.952 0 0.967 3
1 200 0.985 6 0.970 1 0.977 8 0.980 8 0.950 5 0.965 4
1 800 0.984 7 0.966 6 0.975 6 0.980 4 0.948 6 0.964 2
2 400 0.981 4 0.964 4 0.972 8 0.978 8 0.941 5 0.959 8
Table 3  健康问句扩展后分类效果对比(TF-IDF)
保留词典
词语数(个)
CBOW skip-gram
P R F P R F
300 0.954 3 0.925 0 0.939 4 0.931 4 0.921 4 0.926 4
600 0.957 2 0.932 1 0.944 5 0.954 7 0.928 8 0.941 6
1 200 0.961 5 0.940 1 0.950 7 0.958 8 0.936 7 0.947 6
1 800 0.965 9 0.941 6 0.953 6 0.963 3 0.938 3 0.950 6
2 400 0.962 1 0.938 9 0.950 4 0.960 1 0.937 2 0.948 5
Table 4  健康问句扩展后分类效果对比(LDA)
模型 P R F
SVM 0.945 7 0.939 1 0.942 4
未进行扩展的CNN 0.959 6 0.927 5 0.943 3
LDA提取关键词后扩展词向量特征 0.965 9 0.941 6 0.953 6
TF-IDF提取关键词后扩展词向量特征 0.987 2 0.972 5 0.979 8
Table 5  不同模型分类效果对比
[1] 国家中医药管理局. 关于深入开展“互联网+医疗健康”便民惠民活动的通知[EB/OL] [ 2018- 08- 12]. http://gcs.satcm.gov.cn/zhengcewenjian/2018-07-18/7410.html.
[1] ( The “Internet+Medical and Health Convenience and Benefited Activities” Printed and Distributed by National Health Commission[EB/OL]. [ 2018- 08- 12]. http://gcs.satcm.gov.cn/zhengcewenjian/2018-07-18/7410.html
[2] Wang X, Zuo Z Y, Zhao K. The Evolution of User Roles in Online Health Communities-a Social Support Perspective[C] //Proceedings of Pacific Asia Conference on Information Systems. 2015: 48-56.
[3] Dogan T, Uysal A K. Improved Inverse Gravity Moment Term Weighting for Text Classification[J]. Expert Systems with Applications, 2019,130:45-59.
[4] 雷朔, 刘旭敏, 徐维祥. 基于词向量特征扩展的中文短文本分类研究[J]. 计算机应用与软件, 2018,35(8):269-274.
[4] ( Lei Shuo, Liu Xumin, Xu Weixiang. Chinese Short Text Classification Based on Word Vector Extension[J]. Computer Applications and Software, 2018,35(8):269-274.)
[5] 曾庆田, 胡晓慧, 李超. 融合主题词嵌入和网络结构分析的主题关键词提取方法[J]. 数据分析与知识发现, 2019,3(7):52-60.
[5] ( Zeng Qingtian, Hu Xiaohui, Li Chao. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. Data Analysis and Knowledge Discovery, 2019,3(7):52-60.)
[6] 夏威. 基于深度学习模型的问题分类[D]. 长沙: 湖南大学, 2018.
[6] ( Xia Wei. Question Classification Based on Deep Learning Model[D]. Changsha: Hunan University, 2018.)
[7] 陈科文, 张祖平, 龙军. 文本分类中基于熵的词权重计算方法研究[J]. 计算机科学与探索, 2016,10(9):1299-1309.
[7] ( Chen Kewen, Zhang Zuping, Long Jun. Research on Entropy-Based Term Weighting Methods in Text Categorization[J]. Journal of Frontiers of Computer Science and Technology, 2016,10(9):1299-1309.)
[8] 王祥翔, 方荟, 陈崇成. 基于朴素贝叶斯的文化旅游文本分类技术研究[J]. 福州大学学报(自然科学版), 2018,46(5):644-649.
[8] ( Wang Xiangxiang, Fang Hui, Chen Chongcheng. Classification Technique of Cultural Tourism Text Based on Naive Bayes[J]. Journal of Fuzhou University (Natural Science Edition), 2018,46(5):644-649.)
[9] 王东波, 何琳, 黄水清. 基于支持向量机的先秦诸子典籍自动分类研究[J]. 图书情报工作, 2017,61(12):71-76.
[9] ( Wang Dongbo, He Lin, Huang Shuiqing. Research of Automatic Classification for Pre-Qin Philosophers Literature Based on the Support Vector Machine[J]. Library and Information Service, 2017,61(12):71-76.)
[10] Rushdi M, Saleh M T, Martín V A, et al. Experiments with SVM to Classify Opinions in Different Domains[J]. Expert Systems with Applications, 2011,38(12):14799-14804.
[11] Huang G, Li Y, Wang Q, et al. Automatic Classification Method for Software Vulnerability Based on Deep Neural Network[J]. IEEE Access, 2019.DOI: 10.1109/ACCESS.2019.2900462.
doi: 10.1109/ACCESS.2019.2894092 pmid: 31741809
[12] 吕超镇, 姬东鸿, 吴飞飞. 基于LDA特征扩展的短文本分类[J]. 计算机工程与应用, 2015,51(4):123-127.
[12] ( Lv Chaozhen, Ji Donghong, Wu Feifei. Short Text Classification Based on Expanding Feature of LDA[J]. Computer Engineering and Applications, 2015,51(4):123-127.)
[13] 黄贤英, 谢晋, 龙姝言. 融合词向量及BTM模型的问题分类方法[J]. 计算机工程与设计, 2019,40(2):91-95.
[13] ( Huang Xianying, Xie Jin, Long Shuyan. Question Classification Method Combining Word Vector and BTM Model[J]. Computer Engineering and Design, 2019,40(2):91-95.)
[14] Luo L. Network Text Sentiment Analysis Method Combining LDA Text Representation and GRU-CNN[J]. Personal and Ubiquitous Computing, 2019,23(3-4):405-412.
[15] De Boom C, Van Canneyt S, Demeester T, et al. Representation Learning for Very Short Texts Using Weighted Word Embedding Aggregation[J]. Pattern Recognition Letters, 2016,80:150-156.
[16] 蔡慧苹. 基于卷积神经网络的短文本分类方法研究[D]. 重庆: 西南大学, 2016.
[16] ( Cai Huiping. Research of Short-text Classification Method Based on Convolution Neural Network[D]. Chongqing: Southwest University, 2016.)
[17] Mikolov T, Zweig G. Context Dependent Recurrent Neural Network Language Model[C] //Proceedings of the 2012 IEEE Workshop on Spoken Language Technology. 2012: 234-239.
[18] 杨开平. 基于语义相似度的中文文本聚类算法研究[D]. 成都: 电子科技大学, 2018.
[18] ( Yang Kaiping. Study on the Chinese Text Clustering Algorithm Based on Semantic Similarity[D]. Chengdu: University of Electronic Science and Technology of China, 2018.)
[19] 谢志峰, 吴佳萍, 马利庄. 基于卷积神经网络的中文财经新闻分类方法[J]. 山东大学学报(工学版), 2018,48(3):34-39.
[19] ( Xie Zhifeng, Wu Jiaping, Ma Lizhuang. Chinese Financial News Classification Method Based on Convolutional Neural Network[J]. Journal of Shandong University (Engineering Science), 2018,48(3):34-39.)
[20] 张闯. 基于深度学习的知乎标题的多标签文本分类[D]. 北京: 北京交通大学, 2018.
[20] ( Zhang Chuang. Multi-Label Text Categorization of Zhihu Title Based on Deep Learning[D]. Beijing: Beijing Jiaotong University, 2018.)
[21] Christodoulou V, Filgueira R, Bee E, et al. Automatic Classification of Aurora-related Tweets Using Machine Learning Methods[C] //Proceedings of the 2nd International Conference on Geoinformatics and Data Analysis. 2019: 115-119.
[22] Yang Z, Fan K F, Lai X X, et al. Short Texts Classification Through Reference Document Expansion[J]. Chinese Journal of Electronics, 2014,32(2):315-321.
[23] Li X, Gao F, Ding C. The Research of Chinese Short-text Classification Based on Domain Keyword Set Extension HowNet[C] //Proceedings of the International Conference on Intelligent and Control and Computer Application. 2016: 244-247.
[24] 邵云飞, 刘东苏. 基于类别特征扩展的短文本分类方法研究[J]. 数据分析与知识发现, 2019,3(9):60-67.
[24] ( Shao Yunfei, Liu Dongsu. Classifying Short-texts with Class Feature Extension[J]. Data Analysis and Knowledge Discovery, 2019,3(9):60-67.)
[25] 靳一凡, 傅颖勋, 马礼. 基于频繁项特征扩展的短文本分类方法[J]. 计算机科学, 2019,46(S1):478-481.
[25] ( Jin Yifan, Fu Yingxun, Ma Li. Method of Short Text Classification Based on Frequent Item Feature Extension[J]. Computer Science, 2019,46(S1):478-481.)
[26] 张群, 王红军, 王伦文. 词向量与LDA相融合的短文本分类方法[J]. 现代图书情报技术, 2016(12):27-35.
[26] ( Zhang Qun, Wang Hongjun, Wang Lunwen. Classifying Short Texts with Word Embedding and LDA Model[J]. New Technology of Library and Information Service, 2016(12):27-35.)
[27] Sun F, Chen H. Feature Extension for Chinese Short Text Classification Based on LDA and Word2vec[C] //Proceedings of the 2018 13th IEEE Conference on Industrial Electronics and Applications (ICIEA). 2018: 1189-1194.
[28] 李杰, 李欢. 基于深度学习的短文本评论产品特征提取及情感分类研究[J]. 情报理论与实践, 2018,41(2):143-148.
[28] ( Li Jie, Li Huan. Research on Product Feature Extraction and Sentiment Classification of Short Online Review Based on Deep Learning[J]. Information Studies: Theory & Application, 2018,41(2):143-148.)
[29] 邱宁佳, 丛琳, 周思丞, 等. 结合改进主动学习的SVD-CNN弹幕文本分类算法[J]. 计算机应用, 2019,39(3):644-650.
[29] ( Qiu Ningjia, Cong Lin, Zhou Sicheng, et al. SVD-CNN Barrage Text Classification Algorithm Combined with Improved Active Learning[J]. Journal of Computer Applications, 2019,39(3):644-650.)
[30] 周飞燕, 金林鹏, 董军. 卷积神经网络研究综述[J]. 计算机学报, 2017,40(6):1229-1251.
[30] ( Zhou Feiyan, Jin Linpeng, Dong Jun. Review of Convolutional Neural Network[J]. Chinese Journal of Computers, 2017,40(6):1229-1251.)
[31] 何明月, 赵桂华. 慢性皮肤病患者生活质量的研究进展[J]. 护理实践与研究, 2013,10(5):118-119.
[31] ( He Mingyue, Zhao Guihua. Research Progress of Quality of Life in Patients with Chronic Dermatosis[J]. Nursing Practice and Research, 2013,10(5):118-119.)
[32] 王茂全. 深度特征学习在句子文本分类中的研究及应用[D]. 上海: 华东师范大学, 2018.
[32] ( Wang Maoquan. Study and Application of Deep Learning in Sentence-level Text Classification[D]. Shanghai: East China Normal University, 2018.)
[1] 叶佳鑫,熊回香,童兆莉,孟秋晴. 在线医疗社区中面向医生的协同标注研究*[J]. 数据分析与知识发现, 2020, 4(6): 118-128.
[2] 岳丽欣,刘自强,胡正银. 面向趋势预测的热点主题演化分析方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 22-34.
[3] 陶兴,张向先,郭顺利,张莉曼. 学术问答社区用户生成内容的W2V-MMR自动摘要方法研究*[J]. 数据分析与知识发现, 2020, 4(4): 109-118.
[4] 叶佳鑫,熊回香,蒋武轩. 一种融合患者咨询文本与决策机理的医生推荐算法*[J]. 数据分析与知识发现, 2020, 4(2/3): 153-164.
[5] 薛福亮,刘丽芳. 一种基于CRF与ATAE-LSTM的细粒度情感分析方法*[J]. 数据分析与知识发现, 2020, 4(2/3): 207-213.
[6] 龚丽娟,王昊,张紫玄,朱立平. Word2Vec对海关报关商品文本特征降维效果分析*[J]. 数据分析与知识发现, 2020, 4(2/3): 89-100.
[7] 余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[8] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[9] 陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 *[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[10] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[11] 蒋翠清,郭轶博,刘尧. 基于中文社交媒体文本的领域情感词典构建方法研究*[J]. 数据分析与知识发现, 2019, 3(2): 98-107.
[12] 陶志勇,李小兵,刘影,刘晓芳. 基于双向长短时记忆网络的改进注意力短文本分类方法 *[J]. 数据分析与知识发现, 2019, 3(12): 21-29.
[13] 李心蕾,王昊,刘小敏,邓三鸿. 面向微博短文本分类的文本向量化方法比较研究*[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[14] 殷聪,张李义. 基于TF-IDF的情境后过滤推荐算法研究*——以餐饮业O2O为例[J]. 数据分析与知识发现, 2018, 2(11): 28-36.
[15] 高永兵,杨贵朋,张娣,马占飞. 基于突显词博文聚类的官微事件检测方法*[J]. 数据分析与知识发现, 2017, 1(9): 57-64.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn