Identifying Pathogens of Foodborne Diseases with Machine Learning
Wang Hanxue1,2,Cui Wenjuan1,Zhou Yuanchun1,2,Du Yi1,2()
1Computer Network Information Center, Chinese Academy of Sciences, Beijing 100089, China 2University of Chinese Academy of Sciences, Beijing 100089, China
[Objective] This paper introduces external data to enhance the word vector representation of exposure foods, and then uses machine learning methods to identify foodborne disease pathogens. [Objective] First, we extracted space, time, patient information, exposure information from foodborne disease cases as features to identify foodborne disease pathogens. Then, we used word vector representation technology integrating domain knowledge to embed foodborne disease exposure foods. Third, we utilized XGBoost machine learning model to examine the correlation among features, and found several important foodborne disease pathogens. [Results] The proposed method yielded more accurate word vector representation of exposure foods than those of the traditional models. It also achieved 68% precision and recall on identifying four important foodborne disease pathogens: Salmonella, Escherichia coli, Vibrio parahaemolyticus and Norovirus, which provides some auxiliary diagnosis and treatment for the patients. [Limitations] We only analyzed four major foodborne disease pathogens. [Conclusions] The proposed method could improve the control of foodborne diseases.
Dodd C E R, Aldsworth T, Stein R A, Cliver D O, Riemann H P. Foodborne Diseases [M]. The Third Edition. Academic Press, 2017.
[2]
Oliver S P. Foodborne Pathogens and Disease Special Issue on the National and International PulseNet Network[J]. Foodborne Pathogens and Disease, 2019, 16(7):439-440.
doi: 10.1089/fpd.2019.29012.int
[3]
Scallan E, Mahon B E. Foodborne Diseases Active Surveillance Network (FoodNet) in 2012: A Foundation for Food Safety in the United States[J]. Clinical Infectious Diseases, 2012, 54(S5):S381-S384.
doi: 10.1093/cid/cis257
[4]
Li W, Lu S, Cui Z G, et al. PulseNet China, a Model for Future Laboratory-Based Bacterial Infectious Disease Surveillance in China[J]. Frontiers of Medicine, 2012, 6(4):366-375.
doi: 10.1007/s11684-012-0214-6
[5]
刘秀梅. 食源性疾病监控技术的研究[J]. 中国食品卫生杂志, 2004, 16(1):3-9.
[5]
( Liu Xiumei. Studies on the Techniques for the Monitoring and Controlling Foodborne Illness[J]. Chinese Journal of Food Hygiene, 2004, 16(1):3-9.)
( Huang Zhaoyong. Prevalence and Surveillance of Foodborne Diseases[J]. Applied Preventive Medicine, 2012, 18(2):125-128.)
[7]
D'Souza R M, Becker N G, Hall G, et al. Does Ambient Temperature Affect Foodborne Disease?[J]. Epidemiology, 2004, 15(1):86-92.
doi: 10.1097/01.ede.0000101021.03453.3e
[8]
Strassle P D, Gu W, Bruce B B, et al. Sex and Age Distributions of Persons in Foodborne Disease Outbreaks and Associations with Food Categories[J]. Epidemiology and Infection, 2019, 147:e200.
doi: 10.1017/S0950268818003126
[9]
孙长颢. 营养与食品卫生学[M]. 第8版. 北京: 人民卫生出版社, 2017.
[9]
( Sun Changhao. Nutrition and Food Hygiene[M]. The 8th Edition. Beijing: People's Medical Publishing House, 2017.)
[10]
Sadilek A, Kautz H, DiPrete L, et al. Deploying Nemesis: Preventing Foodborne Illness by Data Mining Social Media [C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016: 3982-3989.
[11]
Scallan E, Hoekstra R M, Angulo F J, et al. Foodborne Illness Acquired in the United States: Major Pathogens[J]. Emerging Infectious Diseases, 2011, 17(1):7-15.
doi: 10.3201/eid1701.P11101
pmid: 21192848
[12]
Richardson L C, Bazaco M C, Parker C C, et al. An Updated Scheme for Categorizing Foods Implicated in Foodborne Disease Outbreaks: A Tri-Agency Collaboration[J]. Foodborne Pathogens and Disease, 2017, 14(12):701-710.
doi: 10.1089/fpd.2017.2324
pmid: 28926300
[13]
Bean N H, Griffin P M, Goulding J S, et al. Foodborne Disease Outbreaks, 5-Year Summary, 1983-1987[J]. MMWR CDC Surveillance Summaries, 1990, 39(1):15-57.
pmid: 2156148
[14]
Mandal P K, Biswas A K, Choi K, et al. Methods for Rapid Detection of Foodborne Pathogens: An Overview[J]. American Journal of Food Technology, 2011, 6(2):87-102.
doi: 10.3923/ajft.2011.87.102
[15]
Flint J A, van Duynhoven Y T, Angulo F J, et al. Estimating the Burden of Acute Gastroenteritis, Foodborne Disease, and Pathogens Commonly Transmitted by Food: An International Review[J]. Clinical Infectious Diseases, 2005, 41(5):698-704.
doi: 10.1086/432064
[16]
Thakur M, Olafsson S, Lee J S, et al. Data Mining for Recognizing Patterns in Foodborne Disease Outbreaks[J]. Journal of Food Engineering, 2010, 97(2):213-227.
doi: 10.1016/j.jfoodeng.2009.10.012
[17]
Vilne B, Meistere I, Grantiņa-Ieviņa L, et al. Machine Learning Approaches for Epidemiological Investigations of Food-Borne Disease Outbreaks[J]. Frontiers in Microbiology, 2019, 10:1722.
doi: 10.3389/fmicb.2019.01722
[18]
Cui W J, Wang P F, Du Y, et al. An Algorithm for Event Detection Based on Social Media Data[J]. Neurocomputing, 2017, 254:53-58.
doi: 10.1016/j.neucom.2016.09.127
[19]
Effland T, Lawson A, Balter S, et al. Discovering Foodborne Illness in Online Restaurant Reviews[J]. Journal of the American Medical Informatics Association, 2018, 25(12):1586-1592.
doi: 10.1093/jamia/ocx093
pmid: 29329402
[20]
Neill D B, Moore A W. Rapid Detection of Significant Spatial Clusters [C]//Proceedings of 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2004: 256-265.
[21]
Xiao X, Ge Y, Guo Y C, et al. Automated Detection for Probable Homologous Foodborne Disease Outbreaks [C]//Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2015: 563-575.
[22]
Mikolov T, Chen K, Corradi G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[23]
Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
[24]
Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online Learning of Social Representations [C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 701-710.
[25]
Řehůřek R, Sojka P. Gensim—Statistical Semantics in Python[R]. EuroScipy, 2011.
[26]
Chen T Q, Guestrin C. XGBoost: A Scalable Tree Boosting System[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16):785-794.