Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (9): 78-87     https://doi.org/10.11925/infotech.1003-3513.2016.09.10
  应用论文 本期目录 | 过刊浏览 | 高级检索 |
基于Word2Vec及大众健康信息源的疾病关联探测
罗文馨,陈翀(),邓思艺
北京师范大学政府管理学院 北京 100875
Detecting Disease Associations with Word2Vec from Consumer Health Information
Luo Wenxin,Chen Chong(),Deng Siyi
School of Government, Beijing Normal University, Beijing 100875, China
全文: PDF (689 KB)   HTML ( 32
输出: BibTeX | EndNote (RIS)      
摘要 

目的】利用Word2Vec深度学习技术从面向大众的健康信息中寻找疾病关联, 解决非医学人士通常不了解多种疾病之间存在的关联, 从而影响到健康信息搜寻中的全面性和有效性的问题。【方法】由专家选取30个常见疾病主题, 从高质量医学新闻网站上采集对应疾病的文档, 运用Word2Vec技术对各疾病的相关文档构造词向量, 计算向量距离判断疾病关联。通过与专家评分的相关分析衡量判断结果的准确性。【结果】最优情况下, Word2Vec得到的结果与专家评分相关系数达到0.635。通过对比不同的算法模型、优化方法、数据规模及重要参数对结果的影响, 发现Skip-Gram模型结合负样本数为20的Negative Sampling优化方法在大规模数据集上的实验结果最优。【局限】疾病主题选取宽泛时, 影响Word2Vec判断准确性, 本文的疾病主题选取粒度有待改善。【结论】利用Word2Vec技术在面向大众的健康信息源中也可以探测疾病关联, 其有效性表明该技术可用于改善大众的健康信息搜寻的个性化服务。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
罗文馨
陈翀
邓思艺
关键词 Word2Vec疾病关联非专业医学文本健康信息个性化    
Abstract

[Objective] Average people usually do not know the complex associations among diseases, which poses negative effects to their health information seeking experience. This study tries to detect the associations among diseases using popular medical information with the help of deep learning technology (Word2Vec), aiming to improve personalized information services. [Methods] First, we identified 30 common disease topics with the help of medical professionals, and then collected related reports from Medical News Today. Second, we built word vector for each document with Word2Vec technology to calculate the semantic similarities among them. Finally, we compared the machine training results with experts’ scores to evaluate the performance of the proposed method. We also investigated the impacts of different models, optimization methods, data sizes and important parameters to the results. [Results] The correlation coefficient between the Word2Vec results and the experts’ scores reached 0.635 in optimal condition. We found that Skip-Gram model with less than 20 negative samples on large scale dataset yielded the best results. [Limitations] The precision of the Word2Vec judgment was affected by the number of disease topics. The granularity of disease topic needed to be improved. [Conclusions] The Word2Vec technology could be used to identify diseases association from consumer health information sources. It could also be used to improve the personalized health information services.

Key wordsWord2Vec    Disease association    Non-professional medical information    Health informaiton    Personalization
收稿日期: 2016-05-16      出版日期: 2016-10-19
引用本文:   
罗文馨,陈翀,邓思艺. 基于Word2Vec及大众健康信息源的疾病关联探测[J]. 现代图书情报技术, 2016, 32(9): 78-87.
Luo Wenxin,Chen Chong,Deng Siyi. Detecting Disease Associations with Word2Vec from Consumer Health Information. New Technology of Library and Information Service, 2016, 32(9): 78-87.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.09.10      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2016/V32/I9/78
[1] Kempson E.Review Article: Consumer Health Information Services[J]. Health Libraries Review, 1984, 1(3): 127-144.
[2] Eysenbach G.Recent Advances: Consumer Health Informatics[J]. BMJ Clinical Research, 2000, 320(7251): 1713-1716.
[3] 侯小妮, 孙静. 北京市三甲医院门诊患者互联网健康信息查寻行为研究[J]. 图书情报工作, 2015, 59(20): 126-131, 11.
[3] (Hou Xiaoni, Sun Jing.Research on Internet Health Information Searching Behaviors of Outpatients from Tertiary Referral Hospital in Beijing[J]. Library and Information Service, 2015, 59(20): 126-131, 11.)
[4] Klavans J L, Muresan S.Evaluation of the DEFINDER System for Fully Automatic Glossary Construction[C]. In: Proceedings AMIA Annual Symposium. 2001: 324-328.
[5] Zeng-Treitler Q, Tse T.Exploring and Developing Consumer Health Vocabularies[J]. Journal of the American Medical Informatics Association, 2006, 13(1): 24-29.
[6] Zeng-Treitler Q, Goryachev S, Tse T, et al.Estimating Consumer Familiarity with Health Terminology: A Context- based Approach[J]. Journal of the American Medical Informatics Association, 2008, 15(3): 349-356.
[7] Burgun A, Bodenreider O.Mapping the UMLS Semantic Network into General Ontologies [C]. In: Proceedings of Annual Symposium. 2001: 81-85.
[8] Keselman A, Smith C A, Divita G, et al.Consumer Health Concepts that do not Map to the UMLS: Where do They Fit?[J]. Journal of the American Medical Informatics Association, 2008, 15(4): 496-505.
[9] Yang Z H, Lin H F, Li Y P, et al.TREC 2005 Genomics Track Experiments at DUTAI [C]. In: Proceedings of the 14th Text REtrieval Conference. 2005: 1-9.
[10] Yang Z H, Lin H F, Li Y P, et al.DUTIR at TREC 2006 Genomics and Enterprise Tracks [C]. In: Proceedings of the 15th Text REtrieval Conference. 2006: 1-10.
[11] Jiang Q, Wang Y, Hao Y, et al.miR2Disease: A Manually Curated Database for microRNA Deregulation in Human Disease[J]. Nucleic Acids Research, 2009, 37(Database issue): D98-104.
[12] Yang H, Yang C C. Using Health Consumer Contributed Data to Detect Adverse Drug Reactions by Association Mining with Temporal Analysis [J]. ACM Transactions on Intelligent Systems & Technology, 2015, 6(4): Article No.55.
[13] Chen A T.Exploring Online Support Spaces: Using Cluster Analysis to Examine Breast Cancer, Diabetes and Fibromyalgia Support Groups[J]. Patient Education and Counseling, 2012, 87(2): 250-257.
[14] 刘红霞, 张进, 陈璟浩. WHO英文网站健康主题语义链接关系社会网络分析[J]. 图书情报工作, 2014, 58(13): 75-82.
[14] (Liu Hongxia, Zhang Jin, Chen Jinghao.Social Network Analysis of Semantic Links Relationships Among Health Topics in WHO English Website[J]. Library and Information Service, 2014, 58(13): 75-82.)
[15] Bengio Y, Schwenk H, Senécal J-S, et al.A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3(6): 1137-1155.
[16] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space [OL]. [2016-05-13]. .
[17] Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality [A]. //Advances in Neural Information Processing Systems[M]. 2013: 3111-3119.
[18] Handler A.An Empirical Study of Semantic Similarity in WordNet and Word2Vec [D]. Columbia University, 2014.
[19] Amunategui M, Markwell T, Rozenfeld Y.Prediction Using Note Text: Synthetic Feature Creation with Word2Vec[J]. Computer Science, 2015(3): 1-6.
[20] Ju R, Zhou P, Li C H, et al.An Efficient Method for Document Categorization Based on Word2Vec and Latent Semantic Analysis [C]. In: Proceedings of the 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing (CIT/IUCC/DASC/PICOM). IEEE, 2015: 2276-2283.
[21] Su Z, Xu H, Zhang D, et al.Chinese Sentiment Classification Using a Neural Network Tool — Word2Vec [C]. In: Proceedings of the 2014 International Conference on Multisensor Fusion and Information Integration for Intelligent Systems (MFI). IEEE, 2014: 1-6.
[1] 柯青, 丁松云, 秦琴. 健康信息可读性对用户认知负荷和信息加工绩效影响眼动实验研究 *[J]. 数据分析与知识发现, 2021, 5(2): 70-82.
[2] 吴彦文, 蔡秋亭, 刘智, 邓云泽. 融合多源数据和场景相似度计算的数字资源推荐研究*[J]. 数据分析与知识发现, 2021, 5(11): 114-123.
[3] 丁浩, 艾文华, 胡广伟, 李树青, 索炜. 融合用户兴趣波动时序的个性化推荐模型*[J]. 数据分析与知识发现, 2021, 5(11): 45-58.
[4] 李跃艳,熊回香,李晓敏. 在线问诊平台中基于组合条件的医生推荐研究*[J]. 数据分析与知识发现, 2020, 4(8): 130-142.
[5] 唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[6] 叶佳鑫,熊回香,童兆莉,孟秋晴. 在线医疗社区中面向医生的协同标注研究*[J]. 数据分析与知识发现, 2020, 4(6): 118-128.
[7] 岳丽欣,刘自强,胡正银. 面向趋势预测的热点主题演化分析方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 22-34.
[8] 苏庆,陈思兆,吴伟民,李小妹,黄佃宽. 基于学习情况协同过滤算法的个性化学习推荐模型研究*[J]. 数据分析与知识发现, 2020, 4(5): 105-117.
[9] 郑淞尹,谈国新,史中超. 基于分段用户群与时间上下文的旅游景点推荐模型研究*[J]. 数据分析与知识发现, 2020, 4(5): 92-104.
[10] 陶兴,张向先,郭顺利,张莉曼. 学术问答社区用户生成内容的W2V-MMR自动摘要方法研究*[J]. 数据分析与知识发现, 2020, 4(4): 109-118.
[11] 叶佳鑫,熊回香,蒋武轩. 一种融合患者咨询文本与决策机理的医生推荐算法*[J]. 数据分析与知识发现, 2020, 4(2/3): 153-164.
[12] 魏伟,郭崇慧,邢小宇. 基于语义关联规则的试题知识点标注及试题推荐*[J]. 数据分析与知识发现, 2020, 4(2/3): 182-191.
[13] 薛福亮,刘丽芳. 一种基于CRF与ATAE-LSTM的细粒度情感分析方法*[J]. 数据分析与知识发现, 2020, 4(2/3): 207-213.
[14] 龚丽娟,王昊,张紫玄,朱立平. Word2Vec对海关报关商品文本特征降维效果分析*[J]. 数据分析与知识发现, 2020, 4(2/3): 89-100.
[15] 刘婧茹,宋阳,贾睿,张翼鹏,罗勇,马敬东. 基于BiLSTM-CRF中文临床文本中受保护的健康信息识别*[J]. 数据分析与知识发现, 2020, 4(10): 124-133.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn