[Objective] Average people usually do not know the complex associations among diseases, which poses negative effects to their health information seeking experience. This study tries to detect the associations among diseases using popular medical information with the help of deep learning technology (Word2Vec), aiming to improve personalized information services. [Methods] First, we identified 30 common disease topics with the help of medical professionals, and then collected related reports from Medical News Today. Second, we built word vector for each document with Word2Vec technology to calculate the semantic similarities among them. Finally, we compared the machine training results with experts’ scores to evaluate the performance of the proposed method. We also investigated the impacts of different models, optimization methods, data sizes and important parameters to the results. [Results] The correlation coefficient between the Word2Vec results and the experts’ scores reached 0.635 in optimal condition. We found that Skip-Gram model with less than 20 negative samples on large scale dataset yielded the best results. [Limitations] The precision of the Word2Vec judgment was affected by the number of disease topics. The granularity of disease topic needed to be improved. [Conclusions] The Word2Vec technology could be used to identify diseases association from consumer health information sources. It could also be used to improve the personalized health information services.
罗文馨,陈翀,邓思艺. 基于Word2Vec及大众健康信息源的疾病关联探测[J]. 现代图书情报技术, 2016, 32(9): 78-87.
Luo Wenxin,Chen Chong,Deng Siyi. Detecting Disease Associations with Word2Vec from Consumer Health Information. New Technology of Library and Information Service, 2016, 32(9): 78-87.
(Hou Xiaoni, Sun Jing.Research on Internet Health Information Searching Behaviors of Outpatients from Tertiary Referral Hospital in Beijing[J]. Library and Information Service, 2015, 59(20): 126-131, 11.)
[4]
Klavans J L, Muresan S.Evaluation of the DEFINDER System for Fully Automatic Glossary Construction[C]. In: Proceedings AMIA Annual Symposium. 2001: 324-328.
[5]
Zeng-Treitler Q, Tse T.Exploring and Developing Consumer Health Vocabularies[J]. Journal of the American Medical Informatics Association, 2006, 13(1): 24-29.
[6]
Zeng-Treitler Q, Goryachev S, Tse T, et al.Estimating Consumer Familiarity with Health Terminology: A Context- based Approach[J]. Journal of the American Medical Informatics Association, 2008, 15(3): 349-356.
[7]
Burgun A, Bodenreider O.Mapping the UMLS Semantic Network into General Ontologies [C]. In: Proceedings of Annual Symposium. 2001: 81-85.
[8]
Keselman A, Smith C A, Divita G, et al.Consumer Health Concepts that do not Map to the UMLS: Where do They Fit?[J]. Journal of the American Medical Informatics Association, 2008, 15(4): 496-505.
[9]
Yang Z H, Lin H F, Li Y P, et al.TREC 2005 Genomics Track Experiments at DUTAI [C]. In: Proceedings of the 14th Text REtrieval Conference. 2005: 1-9.
[10]
Yang Z H, Lin H F, Li Y P, et al.DUTIR at TREC 2006 Genomics and Enterprise Tracks [C]. In: Proceedings of the 15th Text REtrieval Conference. 2006: 1-10.
[11]
Jiang Q, Wang Y, Hao Y, et al.miR2Disease: A Manually Curated Database for microRNA Deregulation in Human Disease[J]. Nucleic Acids Research, 2009, 37(Database issue): D98-104.
[12]
Yang H, Yang C C. Using Health Consumer Contributed Data to Detect Adverse Drug Reactions by Association Mining with Temporal Analysis [J]. ACM Transactions on Intelligent Systems & Technology, 2015, 6(4): Article No.55.
[13]
Chen A T.Exploring Online Support Spaces: Using Cluster Analysis to Examine Breast Cancer, Diabetes and Fibromyalgia Support Groups[J]. Patient Education and Counseling, 2012, 87(2): 250-257.
(Liu Hongxia, Zhang Jin, Chen Jinghao.Social Network Analysis of Semantic Links Relationships Among Health Topics in WHO English Website[J]. Library and Information Service, 2014, 58(13): 75-82.)
[15]
Bengio Y, Schwenk H, Senécal J-S, et al.A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3(6): 1137-1155.
[16]
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space [OL]. [2016-05-13]. .
[17]
Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality [A]. //Advances in Neural Information Processing Systems[M]. 2013: 3111-3119.
[18]
Handler A.An Empirical Study of Semantic Similarity in WordNet and Word2Vec [D]. Columbia University, 2014.
[19]
Amunategui M, Markwell T, Rozenfeld Y.Prediction Using Note Text: Synthetic Feature Creation with Word2Vec[J]. Computer Science, 2015(3): 1-6.
[20]
Ju R, Zhou P, Li C H, et al.An Efficient Method for Document Categorization Based on Word2Vec and Latent Semantic Analysis [C]. In: Proceedings of the 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing (CIT/IUCC/DASC/PICOM). IEEE, 2015: 2276-2283.
[21]
Su Z, Xu H, Zhang D, et al.Chinese Sentiment Classification Using a Neural Network Tool — Word2Vec [C]. In: Proceedings of the 2014 International Conference on Multisensor Fusion and Information Integration for Intelligent Systems (MFI). IEEE, 2014: 1-6.