Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (9): 54-62    DOI: 10.11925/infotech.2096-3467.2020.1105
Current Issue | Archive | Adv Search |
Identifying Pathogens of Foodborne Diseases with Machine Learning
Wang Hanxue1,2,Cui Wenjuan1,Zhou Yuanchun1,2,Du Yi1,2()
1Computer Network Information Center, Chinese Academy of Sciences, Beijing 100089, China
2University of Chinese Academy of Sciences, Beijing 100089, China
Download: PDF (1750 KB)   HTML ( 23
Export: BibTeX | EndNote (RIS)      

[Objective] This paper introduces external data to enhance the word vector representation of exposure foods, and then uses machine learning methods to identify foodborne disease pathogens. [Objective] First, we extracted space, time, patient information, exposure information from foodborne disease cases as features to identify foodborne disease pathogens. Then, we used word vector representation technology integrating domain knowledge to embed foodborne disease exposure foods. Third, we utilized XGBoost machine learning model to examine the correlation among features, and found several important foodborne disease pathogens. [Results] The proposed method yielded more accurate word vector representation of exposure foods than those of the traditional models. It also achieved 68% precision and recall on identifying four important foodborne disease pathogens: Salmonella, Escherichia coli, Vibrio parahaemolyticus and Norovirus, which provides some auxiliary diagnosis and treatment for the patients. [Limitations] We only analyzed four major foodborne disease pathogens. [Conclusions] The proposed method could improve the control of foodborne diseases.

Key wordsFoodborne Disease      Pathogen Identification      Word Embedding      Machine Learning     
Received: 10 November 2020      Published: 24 November 2020
ZTFLH:  TP399  
Fund:*National Key Research and Development Plan(2017YFC1601504);National Natural Science Foundation of China(61836013)
Corresponding Authors: Du Yi     E-mail:

Cite this article:

Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning. Data Analysis and Knowledge Discovery, 2021, 5(9): 54-62.

URL:     OR

Flowchart of Word Vector Representation Based on Domain Knowledge
字段 示例
菜名 鱼香肉丝
食材清单 猪里脊肉, 冬笋, 胡萝卜, 黑木耳, 葱末, 姜末, 蒜末, 剁椒, 水淀粉, 料酒, 醋, 生抽, 白糖, 香麻油, 水
做法描述 黑木耳用温水泡发开洗净。猪里脊肉顺势切成丝,先用一小撮盐抓捏到发粘。一茶匙干淀粉加一点水搅成水淀粉,加入肉丝中用手抓捏到全部被吸收…
类别 鱼香肉丝,家常菜,下饭菜
Sample Recipe Data
The Process of Random Walk
参数名 参数设置
corpus_size 268 GB
window size 5
min_count 10
size 128
negative_sample 0
iteration 5
Word Vector Pre-training Model Parameter Settings
The Distribution of Word Vectors of Some Categories of Exposure Food in Two-dimensional Space
The Number of Word Vectors OOV Between General Corpus and Domain Knowledge Corpus
Precision 68%
Recall 68%
F1-Score 68%
Experimental Results

3710 566 667 266
623 3 926 463 415
792 405 1 920 364
210 338 326 2 444
Confusion Matrix
[1] Dodd C E R, Aldsworth T, Stein R A, Cliver D O, Riemann H P. Foodborne Diseases [M]. The Third Edition. Academic Press, 2017.
[2] Oliver S P. Foodborne Pathogens and Disease Special Issue on the National and International PulseNet Network[J]. Foodborne Pathogens and Disease, 2019, 16(7):439-440.
doi: 10.1089/
[3] Scallan E, Mahon B E. Foodborne Diseases Active Surveillance Network (FoodNet) in 2012: A Foundation for Food Safety in the United States[J]. Clinical Infectious Diseases, 2012, 54(S5):S381-S384.
doi: 10.1093/cid/cis257
[4] Li W, Lu S, Cui Z G, et al. PulseNet China, a Model for Future Laboratory-Based Bacterial Infectious Disease Surveillance in China[J]. Frontiers of Medicine, 2012, 6(4):366-375.
doi: 10.1007/s11684-012-0214-6
[5] 刘秀梅. 食源性疾病监控技术的研究[J]. 中国食品卫生杂志, 2004, 16(1):3-9.
[5] ( Liu Xiumei. Studies on the Techniques for the Monitoring and Controlling Foodborne Illness[J]. Chinese Journal of Food Hygiene, 2004, 16(1):3-9.)
[6] 黄兆勇. 食源性疾病的流行和监测现状[J]. 应用预防医学, 2012, 18(2):125-128.
[6] ( Huang Zhaoyong. Prevalence and Surveillance of Foodborne Diseases[J]. Applied Preventive Medicine, 2012, 18(2):125-128.)
[7] D'Souza R M, Becker N G, Hall G, et al. Does Ambient Temperature Affect Foodborne Disease?[J]. Epidemiology, 2004, 15(1):86-92.
doi: 10.1097/01.ede.0000101021.03453.3e
[8] Strassle P D, Gu W, Bruce B B, et al. Sex and Age Distributions of Persons in Foodborne Disease Outbreaks and Associations with Food Categories[J]. Epidemiology and Infection, 2019, 147:e200.
doi: 10.1017/S0950268818003126
[9] 孙长颢. 营养与食品卫生学[M]. 第8版. 北京: 人民卫生出版社, 2017.
[9] ( Sun Changhao. Nutrition and Food Hygiene[M]. The 8th Edition. Beijing: People's Medical Publishing House, 2017.)
[10] Sadilek A, Kautz H, DiPrete L, et al. Deploying Nemesis: Preventing Foodborne Illness by Data Mining Social Media [C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016: 3982-3989.
[11] Scallan E, Hoekstra R M, Angulo F J, et al. Foodborne Illness Acquired in the United States: Major Pathogens[J]. Emerging Infectious Diseases, 2011, 17(1):7-15.
doi: 10.3201/eid1701.P11101 pmid: 21192848
[12] Richardson L C, Bazaco M C, Parker C C, et al. An Updated Scheme for Categorizing Foods Implicated in Foodborne Disease Outbreaks: A Tri-Agency Collaboration[J]. Foodborne Pathogens and Disease, 2017, 14(12):701-710.
doi: 10.1089/fpd.2017.2324 pmid: 28926300
[13] Bean N H, Griffin P M, Goulding J S, et al. Foodborne Disease Outbreaks, 5-Year Summary, 1983-1987[J]. MMWR CDC Surveillance Summaries, 1990, 39(1):15-57.
pmid: 2156148
[14] Mandal P K, Biswas A K, Choi K, et al. Methods for Rapid Detection of Foodborne Pathogens: An Overview[J]. American Journal of Food Technology, 2011, 6(2):87-102.
doi: 10.3923/ajft.2011.87.102
[15] Flint J A, van Duynhoven Y T, Angulo F J, et al. Estimating the Burden of Acute Gastroenteritis, Foodborne Disease, and Pathogens Commonly Transmitted by Food: An International Review[J]. Clinical Infectious Diseases, 2005, 41(5):698-704.
doi: 10.1086/432064
[16] Thakur M, Olafsson S, Lee J S, et al. Data Mining for Recognizing Patterns in Foodborne Disease Outbreaks[J]. Journal of Food Engineering, 2010, 97(2):213-227.
doi: 10.1016/j.jfoodeng.2009.10.012
[17] Vilne B, Meistere I, Grantiņa-Ieviņa L, et al. Machine Learning Approaches for Epidemiological Investigations of Food-Borne Disease Outbreaks[J]. Frontiers in Microbiology, 2019, 10:1722.
doi: 10.3389/fmicb.2019.01722
[18] Cui W J, Wang P F, Du Y, et al. An Algorithm for Event Detection Based on Social Media Data[J]. Neurocomputing, 2017, 254:53-58.
doi: 10.1016/j.neucom.2016.09.127
[19] Effland T, Lawson A, Balter S, et al. Discovering Foodborne Illness in Online Restaurant Reviews[J]. Journal of the American Medical Informatics Association, 2018, 25(12):1586-1592.
doi: 10.1093/jamia/ocx093 pmid: 29329402
[20] Neill D B, Moore A W. Rapid Detection of Significant Spatial Clusters [C]//Proceedings of 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2004: 256-265.
[21] Xiao X, Ge Y, Guo Y C, et al. Automated Detection for Probable Homologous Foodborne Disease Outbreaks [C]//Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2015: 563-575.
[22] Mikolov T, Chen K, Corradi G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[23] Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
[24] Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online Learning of Social Representations [C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 701-710.
[25] Řehůřek R, Sojka P. Gensim—Statistical Semantics in Python[R]. EuroScipy, 2011.
[26] Chen T Q, Guestrin C. XGBoost: A Scalable Tree Boosting System[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16):785-794.
[1] Chen Donghua,Zhao Hongmei,Shang Xiaopu,Zhang Runtong. Optimizing Large Hospital Operating Rooms with Data Analytics[J]. 数据分析与知识发现, 2021, 5(9): 115-128.
[2] Che Hongxin,Wang Tong,Wang Wei. Comparing Prediction Models for Prostate Cancer[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[3] Su Qiang, Hou Xiaoli, Zou Ni. Predicting Surgical Infections Based on Machine Learning[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[4] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[5] Cao Rui,Liao Bin,Li Min,Sun Ruina. Predicting Prices and Analyzing Features of Online Short-Term Rentals Based on XGBoost[J]. 数据分析与知识发现, 2021, 5(6): 51-65.
[6] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[7] Xiang Zhuoyuan,Liu Zhicong,Wu Yu. Adaptive Recommendation Model Based on User Behaviors[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[8] Shen Si,Li Qinyu,Ye Yuan,Sun Hao,Ye Wenhao. Topic Mining and Evolution Analysis of Medical Sci-Tech Reports with TWE Model[J]. 数据分析与知识发现, 2021, 5(3): 35-44.
[9] Chai Guorong,Wang Bin,Sha Yongzhong. Public Health Risk Forecasting with Multiple Machine Learning Methods Combined:Case Study of Influenza Forecasting in Lanzhou, China[J]. 数据分析与知识发现, 2021, 5(1): 90-98.
[10] Chen Dong,Wang Jiandong,Li Huiying,Cai Sihang,Huang Qianqian,Yi Chengqi,Cao Pan. Forecasting Poultry Turnovers with Machine Learning and Multiple Factors[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
[11] Liang Ye,Li Xiaoyuan,Xu Hang,Hu Yiran. CLOpin: A Cross-Lingual Knowledge Graph Framework for Public Opinion Analysis and Early Warning[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[12] Wei Tingxin,Bai Wenlei,Qu Weiguang. Sense Prediction for Chinese OOV Based on Word Embedding and Semantic Knowledge[J]. 数据分析与知识发现, 2020, 4(6): 109-117.
[13] Yang Heng,Wang Sili,Zhu Zhongming,Liu Wei,Wang Nan. Recommending Domain Knowledge Based on Parallel Collaborative Filtering Algorithm[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[14] Su Chuandong,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Mao Junyu,Zhu Jiaying,Pan Yuhao. Identifying Chinese / English Metaphors with Word Embedding and Recurrent Neural Network[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[15] Wang Sili,Zhu Zhongming,Yang Heng,Liu Wei. Automatically Identifying Hypernym-Hyponym Relations of Domain Concepts with Patterns and Projection Learning[J]. 数据分析与知识发现, 2020, 4(11): 15-25.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938