Identifying Pathogens of Foodborne Diseases with Machine Learning
Wang Hanxue1,2,Cui Wenjuan1,Zhou Yuanchun1,2,Du Yi1,2()
1Computer Network Information Center, Chinese Academy of Sciences, Beijing 100089, China
2University of Chinese Academy of Sciences, Beijing 100089, China
[Objective] This paper introduces external data to enhance the word vector representation of exposure foods, and then uses machine learning methods to identify foodborne disease pathogens. [Objective] First, we extracted space, time, patient information, exposure information from foodborne disease cases as features to identify foodborne disease pathogens. Then, we used word vector representation technology integrating domain knowledge to embed foodborne disease exposure foods. Third, we utilized XGBoost machine learning model to examine the correlation among features, and found several important foodborne disease pathogens. [Results] The proposed method yielded more accurate word vector representation of exposure foods than those of the traditional models. It also achieved 68% precision and recall on identifying four important foodborne disease pathogens: Salmonella, Escherichia coli, Vibrio parahaemolyticus and Norovirus, which provides some auxiliary diagnosis and treatment for the patients. [Limitations] We only analyzed four major foodborne disease pathogens. [Conclusions] The proposed method could improve the control of foodborne diseases.

Key wordsFoodborne Disease      Pathogen Identification      Word Embedding      Machine Learning     
Received: 10 November 2020      Published: 24 November 2020
ZTFLH:  TP399  
Fund:*National Key Research and Development Plan(2017YFC1601504);National Natural Science Foundation of China(61836013)
Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning. Data Analysis and Knowledge Discovery, 2021, 5(9): 54-62.

Flowchart of Word Vector Representation Based on Domain Knowledge
字段 示例
菜名 鱼香肉丝
食材清单 猪里脊肉, 冬笋, 胡萝卜, 黑木耳, 葱末, 姜末, 蒜末, 剁椒, 水淀粉, 料酒, 醋, 生抽, 白糖, 香麻油, 水
做法描述 黑木耳用温水泡发开洗净。猪里脊肉顺势切成丝,先用一小撮盐抓捏到发粘。一茶匙干淀粉加一点水搅成水淀粉,加入肉丝中用手抓捏到全部被吸收…
类别 鱼香肉丝,家常菜,下饭菜
Sample Recipe Data
The Process of Random Walk
参数名 参数设置
corpus_size 268 GB
window size 5
min_count 10
size 128
negative_sample 0
iteration 5
Word Vector Pre-training Model Parameter Settings
The Distribution of Word Vectors of Some Categories of Exposure Food in Two-dimensional Space
The Number of Word Vectors OOV Between General Corpus and Domain Knowledge Corpus
Precision 68%
Recall 68%
F1-Score 68%
Experimental Results

3710 566 667 266
623 3 926 463 415
792 405 1 920 364
210 338 326 2 444
Confusion Matrix
