Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (9): 78-87    DOI: 10.11925/infotech.1003-3513.2016.09.10
  应用论文 本期目录 | 过刊浏览 | 高级检索 |
基于Word2Vec及大众健康信息源的疾病关联探测
罗文馨,陈翀(),邓思艺
北京师范大学政府管理学院 北京 100875
Detecting Disease Associations with Word2Vec from Consumer Health Information
Luo Wenxin,Chen Chong(),Deng Siyi
School of Government, Beijing Normal University, Beijing 100875, China
全文: PDF(689 KB)   HTML ( 32
输出: BibTeX | EndNote (RIS)      
摘要 

目的】利用Word2Vec深度学习技术从面向大众的健康信息中寻找疾病关联, 解决非医学人士通常不了解多种疾病之间存在的关联, 从而影响到健康信息搜寻中的全面性和有效性的问题。【方法】由专家选取30个常见疾病主题, 从高质量医学新闻网站上采集对应疾病的文档, 运用Word2Vec技术对各疾病的相关文档构造词向量, 计算向量距离判断疾病关联。通过与专家评分的相关分析衡量判断结果的准确性。【结果】最优情况下, Word2Vec得到的结果与专家评分相关系数达到0.635。通过对比不同的算法模型、优化方法、数据规模及重要参数对结果的影响, 发现Skip-Gram模型结合负样本数为20的Negative Sampling优化方法在大规模数据集上的实验结果最优。【局限】疾病主题选取宽泛时, 影响Word2Vec判断准确性, 本文的疾病主题选取粒度有待改善。【结论】利用Word2Vec技术在面向大众的健康信息源中也可以探测疾病关联, 其有效性表明该技术可用于改善大众的健康信息搜寻的个性化服务。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
罗文馨
陈翀
邓思艺
关键词 Word2Vec疾病关联非专业医学文本健康信息个性化    
Abstract

[Objective] Average people usually do not know the complex associations among diseases, which poses negative effects to their health information seeking experience. This study tries to detect the associations among diseases using popular medical information with the help of deep learning technology (Word2Vec), aiming to improve personalized information services. [Methods] First, we identified 30 common disease topics with the help of medical professionals, and then collected related reports from Medical News Today. Second, we built word vector for each document with Word2Vec technology to calculate the semantic similarities among them. Finally, we compared the machine training results with experts’ scores to evaluate the performance of the proposed method. We also investigated the impacts of different models, optimization methods, data sizes and important parameters to the results. [Results] The correlation coefficient between the Word2Vec results and the experts’ scores reached 0.635 in optimal condition. We found that Skip-Gram model with less than 20 negative samples on large scale dataset yielded the best results. [Limitations] The precision of the Word2Vec judgment was affected by the number of disease topics. The granularity of disease topic needed to be improved. [Conclusions] The Word2Vec technology could be used to identify diseases association from consumer health information sources. It could also be used to improve the personalized health information services.

Key wordsWord2Vec    Disease association    Non-professional medical information    Health informaiton    Personalization
收稿日期: 2016-05-16     
引用本文:   
罗文馨,陈翀,邓思艺. 基于Word2Vec及大众健康信息源的疾病关联探测[J]. 现代图书情报技术, 2016, 32(9): 78-87.
Luo Wenxin,Chen Chong,Deng Siyi. Detecting Disease Associations with Word2Vec from Consumer Health Information. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2016.09.10.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.09.10
[1] Kempson E.Review Article: Consumer Health Information Services[J]. Health Libraries Review, 1984, 1(3): 127-144.
[2] Eysenbach G.Recent Advances: Consumer Health Informatics[J]. BMJ Clinical Research, 2000, 320(7251): 1713-1716.
[3] 侯小妮, 孙静. 北京市三甲医院门诊患者互联网健康信息查寻行为研究[J]. 图书情报工作, 2015, 59(20): 126-131, 11.
[3] (Hou Xiaoni, Sun Jing.Research on Internet Health Information Searching Behaviors of Outpatients from Tertiary Referral Hospital in Beijing[J]. Library and Information Service, 2015, 59(20): 126-131, 11.)
[4] Klavans J L, Muresan S.Evaluation of the DEFINDER System for Fully Automatic Glossary Construction[C]. In: Proceedings AMIA Annual Symposium. 2001: 324-328.
[5] Zeng-Treitler Q, Tse T.Exploring and Developing Consumer Health Vocabularies[J]. Journal of the American Medical Informatics Association, 2006, 13(1): 24-29.
[6] Zeng-Treitler Q, Goryachev S, Tse T, et al.Estimating Consumer Familiarity with Health Terminology: A Context- based Approach[J]. Journal of the American Medical Informatics Association, 2008, 15(3): 349-356.
[7] Burgun A, Bodenreider O.Mapping the UMLS Semantic Network into General Ontologies [C]. In: Proceedings of Annual Symposium. 2001: 81-85.
[8] Keselman A, Smith C A, Divita G, et al.Consumer Health Concepts that do not Map to the UMLS: Where do They Fit?[J]. Journal of the American Medical Informatics Association, 2008, 15(4): 496-505.
[9] Yang Z H, Lin H F, Li Y P, et al.TREC 2005 Genomics Track Experiments at DUTAI [C]. In: Proceedings of the 14th Text REtrieval Conference. 2005: 1-9.
[10] Yang Z H, Lin H F, Li Y P, et al.DUTIR at TREC 2006 Genomics and Enterprise Tracks [C]. In: Proceedings of the 15th Text REtrieval Conference. 2006: 1-10.
[11] Jiang Q, Wang Y, Hao Y, et al.miR2Disease: A Manually Curated Database for microRNA Deregulation in Human Disease[J]. Nucleic Acids Research, 2009, 37(Database issue): D98-104.
[12] Yang H, Yang C C. Using Health Consumer Contributed Data to Detect Adverse Drug Reactions by Association Mining with Temporal Analysis [J]. ACM Transactions on Intelligent Systems & Technology, 2015, 6(4): Article No.55.
[13] Chen A T.Exploring Online Support Spaces: Using Cluster Analysis to Examine Breast Cancer, Diabetes and Fibromyalgia Support Groups[J]. Patient Education and Counseling, 2012, 87(2): 250-257.
[14] 刘红霞, 张进, 陈璟浩. WHO英文网站健康主题语义链接关系社会网络分析[J]. 图书情报工作, 2014, 58(13): 75-82.
[14] (Liu Hongxia, Zhang Jin, Chen Jinghao.Social Network Analysis of Semantic Links Relationships Among Health Topics in WHO English Website[J]. Library and Information Service, 2014, 58(13): 75-82.)
[15] Bengio Y, Schwenk H, Senécal J-S, et al.A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3(6): 1137-1155.
[16] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space [OL]. [2016-05-13]. .
[17] Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality [A]. //Advances in Neural Information Processing Systems[M]. 2013: 3111-3119.
[18] Handler A.An Empirical Study of Semantic Similarity in WordNet and Word2Vec [D]. Columbia University, 2014.
[19] Amunategui M, Markwell T, Rozenfeld Y.Prediction Using Note Text: Synthetic Feature Creation with Word2Vec[J]. Computer Science, 2015(3): 1-6.
[20] Ju R, Zhou P, Li C H, et al.An Efficient Method for Document Categorization Based on Word2Vec and Latent Semantic Analysis [C]. In: Proceedings of the 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing (CIT/IUCC/DASC/PICOM). IEEE, 2015: 2276-2283.
[21] Su Z, Xu H, Zhang D, et al.Chinese Sentiment Classification Using a Neural Network Tool — Word2Vec [C]. In: Proceedings of the 2014 International Conference on Multisensor Fusion and Information Integration for Intelligent Systems (MFI). IEEE, 2014: 1-6.
[1] 张怡文,张臣坤,杨安桔,计成睿,岳丽华. 基于条件型游走的四部图推荐方法*[J]. 数据分析与知识发现, 2019, 3(4): 117-125.
[2] 叶佳鑫,熊回香. 基于标签的跨领域资源个性化推荐研究*[J]. 数据分析与知识发现, 2019, 3(2): 21-32.
[3] 蒋翠清,郭轶博,刘尧. 基于中文社交媒体文本的领域情感词典构建方法研究*[J]. 数据分析与知识发现, 2019, 3(2): 98-107.
[4] 李心蕾,王昊,刘小敏,邓三鸿. 面向微博短文本分类的文本向量化方法比较研究*[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[5] 李杰,杨芳,徐晨曦. 考虑时间动态性和序列模式的个性化推荐算法*[J]. 数据分析与知识发现, 2018, 2(7): 72-80.
[6] 高永兵,杨贵朋,张娣,马占飞. 基于突显词博文聚类的官微事件检测方法*[J]. 数据分析与知识发现, 2017, 1(9): 57-64.
[7] 张琴,郭红梅,张智雄. 融合词嵌入表示特征的实体关系抽取方法研究*[J]. 数据分析与知识发现, 2017, 1(9): 8-15.
[8] 侯银秀,李伟卿,王伟军,张婷婷. 基于用户偏好与商品属性情感匹配的图书个性化推荐研究*[J]. 数据分析与知识发现, 2017, 1(8): 9-17.
[9] 陈梅梅,薛康杰. 基于标签簇多构面信任关系的个性化推荐算法研究*[J]. 数据分析与知识发现, 2017, 1(5): 94-101.
[10] 陈梅梅, 薛康杰. 基于改进张量分解模型的个性化推荐算法研究*[J]. 数据分析与知识发现, 2017, 1(3): 38-45.
[11] 夏天. 词向量聚类加权TextRank的关键词抽取*[J]. 数据分析与知识发现, 2017, 1(2): 28-34.
[12] 刘睿伦,叶文豪,高瑞卿,唐梦嘉,王东波. 基于大数据岗位需求的文本聚类研究*[J]. 数据分析与知识发现, 2017, 1(12): 32-40.
[13] 谭学清,张磊,黄翠翠,罗琳. 融合领域专家信任与相似度的协同过滤推荐算法研究*[J]. 现代图书情报技术, 2016, 32(7-8): 101-109.
[14] 宁建飞,刘降珍. 融合Word2vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016, 32(6): 20-27.
[15] 谢琪,崔梦天. 基于相似性群体的混合型Web服务推荐*[J]. 现代图书情报技术, 2016, 32(6): 80-87.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn