Segmenting Chinese Words from Food Safety Emergencies
Zhang Yue1, Wang Dongbo1,2(), Zhu Danhao3
1College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China 2Research Center for Correlation of Domain Knowledge, Nanjing Agricultural University, Nanjing 210095, China 3Library of Jiangsu Police Institute, Nanjing 210031, China
[Objective] This paper examines the automatic word segmentation models, which plays key roles to build databases for food safety administration. We used the statistical learning method based on conditional random field to segment words from food safety emergencies. [Methods] First, we analyzed the length of target words and conducted multiple experiments on the selection and template of word features for the automatic segmentation methods. Second, we identified the impacts of different features and templates to the segmentation results. [Results] We found that selecting more features might not yield better results due to the characteristics interference. About 46.62% of the phrases from the corpus of food safety emergencies only contained two or three words. The first words before and after the current word of the features template pose more effects to the results. [Conclusions] We have identified the optimal feature and template for the automatic segmentation of words and the F score reaches 92.88% with the 5Tag features.
(Wu Yunhong, Zhu Liang, Chu Wei, et al.Key of Food Supervision and Administration Reform-dynamic and Third Party Database Based on Internet[J]. Science and Technology of Food Industry, 2009 (9): 272-274.)
(Jia Kai, Peng Peihao, Ruan Weiling.Study on the Investigation of Farmer Cooperatives in Sanjie Town, Pengzhou City, Sichuan Province[J]. Beijing Agriculture, 2014(3): 247-248.)
(Liu Zewen, Ding Dong, Li Chunwen.Chinese Word Segmentation Method for Short Chinese Text Based on Conditional Random Fields[J]. Journal of Tsinghua University:Science and Technology, 2015, 55(8): 16-20.)
Lafferty J D, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]//Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
Pearl J.Bayes and Markov Networks:A Comparison of Two Graphical Representations of Probabilistic Knowledge [R]. Los Angeles, California, USA: University of California, 1986.
Wallach H M.Conditional Random Fields: An Introduction [EB/OL]. (2004-02-24). .
CRF++: Yet Another CRF Toolkit [EB/OL]. [2014-08-04]. .
(Huang Shuiqing, Wang Dongbo, He Lin.Exploring of Word Segmentation for Fore-Qin Literature Based on the Domain Glossary of Sinological Index Series[J]. Library and Information Service, 2015, 59(11): 127-133.)
Zhao H, Huang C N, Li M, et al.An Improved Chinese Word Segmentation System with Conditional Random Field[C]// Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing.2006: 162-165.