Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (9): 54-62     https://doi.org/10.11925/infotech.2096-3467.2020.1105
     研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于机器学习的食源性疾病致病菌识别方法*
王寒雪1,2,崔文娟1,周园春1,2,杜一1,2()
1中国科学院计算机网络信息中心 北京 100089
2中国科学院大学 北京 100089
Identifying Pathogens of Foodborne Diseases with Machine Learning
Wang Hanxue1,2,Cui Wenjuan1,Zhou Yuanchun1,2,Du Yi1,2()
1Computer Network Information Center, Chinese Academy of Sciences, Beijing 100089, China
2University of Chinese Academy of Sciences, Beijing 100089, China
全文: PDF (1750 KB)   HTML ( 16
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 引入外部食品领域数据增强暴露食品的词向量表征,利用机器学习方法对食源性疾病致病菌进行识别。【方法】 通过从食源性疾病病例数据中提取出空间、时间、患者信息、暴露食品信息等作为食源性疾病致病菌识别的特征数据,并进一步利用融合领域知识的词向量表征等技术对食源性疾病暴露食品进行表征,使用XGBoost机器学习模型挖掘、学习特征之间的相关性,从而实现对几种重要的食源性疾病致病菌的识别。【结果】 通过融合领域数据的词向量表征方法,可以获得比基于通用语料的词向量模型更加准确的暴露食品词向量表征。对沙门氏菌、诺如病毒、致泻大肠埃希氏菌属、副溶血性弧菌4种重要的食源性疾病致病菌的识别能够达到68%的精确率和召回率,为食源性疾病致病菌的辅助诊疗提供帮助。【局限】 仅对4种主要食源性疾病致病菌进行分析。【结论】 相关的分析结果能够指导食源性疾病的管理、处置方案的制定,基于分析结果和机器学习方法的致病菌识别能为食源性疾病的临床辅助诊疗提供有益支持。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
王寒雪
崔文娟
周园春
杜一
关键词 食源性疾病致病菌识别词表征模型机器学习    
Abstract

[Objective] This paper introduces external data to enhance the word vector representation of exposure foods, and then uses machine learning methods to identify foodborne disease pathogens. [Objective] First, we extracted space, time, patient information, exposure information from foodborne disease cases as features to identify foodborne disease pathogens. Then, we used word vector representation technology integrating domain knowledge to embed foodborne disease exposure foods. Third, we utilized XGBoost machine learning model to examine the correlation among features, and found several important foodborne disease pathogens. [Results] The proposed method yielded more accurate word vector representation of exposure foods than those of the traditional models. It also achieved 68% precision and recall on identifying four important foodborne disease pathogens: Salmonella, Escherichia coli, Vibrio parahaemolyticus and Norovirus, which provides some auxiliary diagnosis and treatment for the patients. [Limitations] We only analyzed four major foodborne disease pathogens. [Conclusions] The proposed method could improve the control of foodborne diseases.

Key wordsFoodborne Disease    Pathogen Identification    Word Embedding    Machine Learning
收稿日期: 2020-11-10      出版日期: 2020-11-24
ZTFLH:  TP399  
基金资助:*国家重点研发计划(2017YFC1601504);国家自然科学基金重点项目的研究成果之一(61836013)
通讯作者: 杜一     E-mail: duyi@cnic.cn
引用本文:   
王寒雪,崔文娟,周园春,杜一. 基于机器学习的食源性疾病致病菌识别方法*[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning. Data Analysis and Knowledge Discovery, 2021, 5(9): 54-62.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.1105      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I9/54
Fig.1  融合领域知识的词向量表征流程
字段 示例
菜名 鱼香肉丝
食材清单 猪里脊肉, 冬笋, 胡萝卜, 黑木耳, 葱末, 姜末, 蒜末, 剁椒, 水淀粉, 料酒, 醋, 生抽, 白糖, 香麻油, 水
做法描述 黑木耳用温水泡发开洗净。猪里脊肉顺势切成丝,先用一小撮盐抓捏到发粘。一茶匙干淀粉加一点水搅成水淀粉,加入肉丝中用手抓捏到全部被吸收…
类别 鱼香肉丝,家常菜,下饭菜
Table 1  菜谱数据示例
Fig.2  随机游走过程
参数名 参数设置
corpus_size 268 GB
window size 5
min_count 10
size 128
negative_sample 0
iteration 5
Table 2  词向量预训练模型参数设置
Fig.3  部分类别的暴露食品词向量在二维空间的分布
Fig.4  通用语料与融入领域知识语料的词向量OOV数量对比
评价指标
Precision 68%
Recall 68%
F1-Score 68%
Table 3  实验结果
真实值

预测值
沙门
氏菌
诺如
病毒
致泻大肠
埃希氏菌属
副溶血性弧菌
沙门氏菌
诺如病毒
致泻大肠埃希氏菌属
副溶血性弧菌
3710 566 667 266
623 3 926 463 415
792 405 1 920 364
210 338 326 2 444
Table 4  混淆矩阵
[1] Dodd C E R, Aldsworth T, Stein R A, Cliver D O, Riemann H P. Foodborne Diseases [M]. The Third Edition. Academic Press, 2017.
[2] Oliver S P. Foodborne Pathogens and Disease Special Issue on the National and International PulseNet Network[J]. Foodborne Pathogens and Disease, 2019, 16(7):439-440.
doi: 10.1089/fpd.2019.29012.int
[3] Scallan E, Mahon B E. Foodborne Diseases Active Surveillance Network (FoodNet) in 2012: A Foundation for Food Safety in the United States[J]. Clinical Infectious Diseases, 2012, 54(S5):S381-S384.
doi: 10.1093/cid/cis257
[4] Li W, Lu S, Cui Z G, et al. PulseNet China, a Model for Future Laboratory-Based Bacterial Infectious Disease Surveillance in China[J]. Frontiers of Medicine, 2012, 6(4):366-375.
doi: 10.1007/s11684-012-0214-6
[5] 刘秀梅. 食源性疾病监控技术的研究[J]. 中国食品卫生杂志, 2004, 16(1):3-9.
[5] ( Liu Xiumei. Studies on the Techniques for the Monitoring and Controlling Foodborne Illness[J]. Chinese Journal of Food Hygiene, 2004, 16(1):3-9.)
[6] 黄兆勇. 食源性疾病的流行和监测现状[J]. 应用预防医学, 2012, 18(2):125-128.
[6] ( Huang Zhaoyong. Prevalence and Surveillance of Foodborne Diseases[J]. Applied Preventive Medicine, 2012, 18(2):125-128.)
[7] D'Souza R M, Becker N G, Hall G, et al. Does Ambient Temperature Affect Foodborne Disease?[J]. Epidemiology, 2004, 15(1):86-92.
doi: 10.1097/01.ede.0000101021.03453.3e
[8] Strassle P D, Gu W, Bruce B B, et al. Sex and Age Distributions of Persons in Foodborne Disease Outbreaks and Associations with Food Categories[J]. Epidemiology and Infection, 2019, 147:e200.
doi: 10.1017/S0950268818003126
[9] 孙长颢. 营养与食品卫生学[M]. 第8版. 北京: 人民卫生出版社, 2017.
[9] ( Sun Changhao. Nutrition and Food Hygiene[M]. The 8th Edition. Beijing: People's Medical Publishing House, 2017.)
[10] Sadilek A, Kautz H, DiPrete L, et al. Deploying Nemesis: Preventing Foodborne Illness by Data Mining Social Media [C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016: 3982-3989.
[11] Scallan E, Hoekstra R M, Angulo F J, et al. Foodborne Illness Acquired in the United States: Major Pathogens[J]. Emerging Infectious Diseases, 2011, 17(1):7-15.
doi: 10.3201/eid1701.P11101 pmid: 21192848
[12] Richardson L C, Bazaco M C, Parker C C, et al. An Updated Scheme for Categorizing Foods Implicated in Foodborne Disease Outbreaks: A Tri-Agency Collaboration[J]. Foodborne Pathogens and Disease, 2017, 14(12):701-710.
doi: 10.1089/fpd.2017.2324 pmid: 28926300
[13] Bean N H, Griffin P M, Goulding J S, et al. Foodborne Disease Outbreaks, 5-Year Summary, 1983-1987[J]. MMWR CDC Surveillance Summaries, 1990, 39(1):15-57.
pmid: 2156148
[14] Mandal P K, Biswas A K, Choi K, et al. Methods for Rapid Detection of Foodborne Pathogens: An Overview[J]. American Journal of Food Technology, 2011, 6(2):87-102.
doi: 10.3923/ajft.2011.87.102
[15] Flint J A, van Duynhoven Y T, Angulo F J, et al. Estimating the Burden of Acute Gastroenteritis, Foodborne Disease, and Pathogens Commonly Transmitted by Food: An International Review[J]. Clinical Infectious Diseases, 2005, 41(5):698-704.
doi: 10.1086/432064
[16] Thakur M, Olafsson S, Lee J S, et al. Data Mining for Recognizing Patterns in Foodborne Disease Outbreaks[J]. Journal of Food Engineering, 2010, 97(2):213-227.
doi: 10.1016/j.jfoodeng.2009.10.012
[17] Vilne B, Meistere I, Grantiņa-Ieviņa L, et al. Machine Learning Approaches for Epidemiological Investigations of Food-Borne Disease Outbreaks[J]. Frontiers in Microbiology, 2019, 10:1722.
doi: 10.3389/fmicb.2019.01722
[18] Cui W J, Wang P F, Du Y, et al. An Algorithm for Event Detection Based on Social Media Data[J]. Neurocomputing, 2017, 254:53-58.
doi: 10.1016/j.neucom.2016.09.127
[19] Effland T, Lawson A, Balter S, et al. Discovering Foodborne Illness in Online Restaurant Reviews[J]. Journal of the American Medical Informatics Association, 2018, 25(12):1586-1592.
doi: 10.1093/jamia/ocx093 pmid: 29329402
[20] Neill D B, Moore A W. Rapid Detection of Significant Spatial Clusters [C]//Proceedings of 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2004: 256-265.
[21] Xiao X, Ge Y, Guo Y C, et al. Automated Detection for Probable Homologous Foodborne Disease Outbreaks [C]//Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2015: 563-575.
[22] Mikolov T, Chen K, Corradi G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[23] Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
[24] Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online Learning of Social Representations [C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 701-710.
[25] Řehůřek R, Sojka P. Gensim—Statistical Semantics in Python[R]. EuroScipy, 2011.
[26] Chen T Q, Guestrin C. XGBoost: A Scalable Tree Boosting System[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16):785-794.
[1] 陈东华,赵红梅,尚小溥,张润彤. 数据驱动的大型医院手术室运营预测与优化方法研究*[J]. 数据分析与知识发现, 2021, 5(9): 115-128.
[2] 车宏鑫,王桐,王伟. 前列腺癌预测模型对比研究*[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[3] 苏强, 侯校理, 邹妮. 基于机器学习组合优化方法的术后感染预测模型研究*[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[4] 曹睿,廖彬,李敏,孙瑞娜. 基于XGBoost的在线短租市场价格预测及特征分析模型*[J]. 数据分析与知识发现, 2021, 5(6): 51-65.
[5] 钟佳娃,刘巍,王思丽,杨恒. 文本情感分析方法及应用综述*[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[6] 向卓元,刘志聪,吴玉. 基于用户行为自适应推荐模型研究 *[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[7] 柴国荣,王斌,沙勇忠. 基于多机器学习方法联合的公共卫生风险预测研究——以兰州市流感预测为例*[J]. 数据分析与知识发现, 2021, 5(1): 90-98.
[8] 陈东,王建冬,李慧颖,蔡思航,黄倩倩,易成岐,曹攀. 融合机器学习算法和多因素的禽肉交易量预测方法研究 *[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
[9] 梁野,李小元,许航,胡伊然. CLOpin:一种面向舆情分析与预警领域的跨语言知识图谱架构*[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[10] 杨恒,王思丽,祝忠明,刘巍,王楠. 基于并行协同过滤算法的领域知识推荐模型研究*[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[11] 王树义,刘赛,马峥. 基于深度迁移学习的微博图像隐私分类研究*[J]. 数据分析与知识发现, 2020, 4(10): 80-92.
[12] 王若佳,张璐,王继民. 基于机器学习的在线问诊平台智能分诊研究[J]. 数据分析与知识发现, 2019, 3(9): 88-97.
[13] 李纲,周华阳,毛进,陈思菁. 基于机器学习的社交媒体用户分类研究 *[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
[14] 胡佳慧,方安,赵琬清,杨晨柳,任慧玲. 面向知识发现的中文电子病历标注方法研究 *[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
[15] 张金柱,胡一鸣. 融合表示学习与机器学习的专利科学引文标题自动抽取研究*[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn