Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (2): 64-72    DOI: 10.11925/infotech.2096-3467.2017.02.09
Orginal Article Current Issue | Archive | Adv Search |
Segmenting Chinese Words from Food Safety Emergencies
Yue Zhang1,Dongbo Wang1,2(),Danhao Zhu3
1College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China
2Research Center for Correlation of Domain Knowledge, Nanjing Agricultural University, Nanjing 210095, China
3Library of Jiangsu Police Institute, Nanjing 210031, China
Download: PDF(1706 KB)   HTML ( 26
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper examines the automatic word segmentation models, which plays key roles to build databases for food safety administration. We used the statistical learning method based on conditional random field to segment words from food safety emergencies. [Methods] First, we analyzed the length of target words and conducted multiple experiments on the selection and template of word features for the automatic segmentation methods. Second, we identified the impacts of different features and templates to the segmentation results. [Results] We found that selecting more features might not yield better results due to the characteristics interference. About 46.62% of the phrases from the corpus of food safety emergencies only contained two or three words. The first words before and after the current word of the features template pose more effects to the results. [Conclusions] We have identified the optimal feature and template for the automatic segmentation of words and the F score reaches 92.88% with the 5Tag features.

Key wordsChinese Word Segmentation      Food Safety      Conditional Random Field      Feature Template      Feature Selection     
Received: 22 September 2016      Published: 27 March 2017

Cite this article:

Yue Zhang,Dongbo Wang,Danhao Zhu. Segmenting Chinese Words from Food Safety Emergencies. Data Analysis and Knowledge Discovery, 2017, 1(2): 64-72.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.02.09     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I2/64

[1] 李洪峰. 食品安全社会共治的现实困境与发展对策[J]. 食品与机械, 2016, 32(4): 234-236.
[1] (Li Hongfeng.Analysis of Realistic Plights and Countermeasures in Social Co- governance on Food Safety in China[J]. Food & Machinery, 2016, 32(4): 234-236.)
[2] 王辉霞. 公众参与食品安全治理法治探析[J]. 商业研究, 2012(4): 170-177.
[2] (Wang Huixia.Public Participation in Food Safety Management of the Rule of Law[J]. Commercial Research, 2012(4): 170-177.)
[3] 奉国和, 郑伟.国内中文自动分词技术研究综述[J].图书情报工作, 2011, 55(2): 41-45.
[3] (Feng Guohe, Zheng Wei.Review of Chinese Automatic Word Segmentation[J]. Library and Information Service, 2011, 55(2): 41-45.)
[4] 张星联, 唐晓纯. 我国食品安全预警数据库系统的建设与实现[J]. 食品科技, 2008, 33(12): 250-254.
[4] (Zhang Xinglian, Tang Xiaochun.Establishment on Database System of Food Safety Early-warning in China[J]. Food Science and Technology, 2008, 33(12): 250-254.)
[5] 吴云红, 朱亮, 初炜, 等. 食品监管改革的关键——基于互联网的动态第三方数据库[J]. 食品工业科技, 2009(9): 272-274.
[5] (Wu Yunhong, Zhu Liang, Chu Wei, et al.Key of Food Supervision and Administration Reform-dynamic and Third Party Database Based on Internet[J]. Science and Technology of Food Industry, 2009 (9): 272-274.)
[6] 余清, 洪源. 加工食品风险数据库的构建思路[J]. 价值工程, 2013(30): 174-175.
[6] (Yu Qing, Hong Yuan.Construction Idea for Risk Database of Processed Food[J]. Value Engineering, 2013(30): 174-175.)
[7] 贾凯, 彭培好, 阮伟玲. 四川省彭州市三界镇农民专业合作社调查研究[J].北京农业, 2014(3): 247-248.
[7] (Jia Kai, Peng Peihao, Ruan Weiling.Study on the Investigation of Farmer Cooperatives in Sanjie Town, Pengzhou City, Sichuan Province[J]. Beijing Agriculture, 2014(3): 247-248.)
[8] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3): 8-19.
[8] (Huang Changning, Zhao Hai.Chinese Word Segmentation: A Decade Review[J]. Journal of Chinese Information Processing, 2007, 21(3): 8-19.)
[9] Zeng D, Wei D, Chau M, et al.Domain-specific Chinese Word Segmentation Using Suffix Tree and Mutual Information[J]. Information Systems Frontiers, 2011, 13(1): 115-125.
[10] 刘泽文, 丁冬, 李春文. 基于条件随机场的中文短文本分词方法[J]. 清华大学学报:自然科学版, 2015, 55(8): 16-20.
[10] (Liu Zewen, Ding Dong, Li Chunwen.Chinese Word Segmentation Method for Short Chinese Text Based on Conditional Random Fields[J]. Journal of Tsinghua University:Science and Technology, 2015, 55(8): 16-20.)
[11] Lafferty J D, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]//Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[12] Pearl J.Bayes and Markov Networks:A Comparison of Two Graphical Representations of Probabilistic Knowledge [R]. Los Angeles, California, USA: University of California, 1986.
[13] Wallach H M.Conditional Random Fields: An Introduction [EB/OL]. (2004-02-24). .
[14] CRF++: Yet Another CRF Toolkit [EB/OL]. [2014-08-04]. .
[15] 中国科学院计算技术研究所. ICTCLAS汉语分词系统 [CP/OL]. (2016-02-17). [2016-06-30]. .
[15] (Institute of Computing Technology of the Chinese Academy of Sciences. ICTCLAS Chinese Word Segmentation System [CP/OL]. (2016-02-17). [2016-06-30].
[16] 岳金媛, 徐金安, 张玉洁. 面向专利文献的汉语分词技术研究[J]. 北京大学学报: 自然科学版, 2013, 49(1): 159-164 .
[16] (Yue Jinyuan, Xu Jin’an, Zhang Yujie.Chinese Word Segmentation for Patent Documents[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2013, 49(1): 159-164.)
[17] Chen L, Li M, Zhang J, et al.A Double-Layer Word Segmentation Combined with Local Ambiguity Word Grid and CRF[J]. Transactions on Computer Science & Technology, 2013, 2(1): 1-8.
[18] 黄水清, 王东波, 何琳. 以《汉学引得丛刊》为领域词表的先秦典籍自动分词探讨[J]. 图书情报工作, 2015, 59(11): 127-133.
[18] (Huang Shuiqing, Wang Dongbo, He Lin.Exploring of Word Segmentation for Fore-Qin Literature Based on the Domain Glossary of Sinological Index Series[J]. Library and Information Service, 2015, 59(11): 127-133.)
[19] Zhao H, Huang C N, Li M, et al.An Improved Chinese Word Segmentation System with Conditional Random Field[C]// Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing.2006: 162-165.
[1] Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[2] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[3] Jiaming Liang,Jie Zhao,Zhou Jianlong,Zhenning Dong. Detecting Collusive Fraudulent Online Transaction with Implicit User Behaviors[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
[4] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[5] Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[6] Huihui Tang,Hao Wang,Zixuan Zhang,Xueying Wang. Extracting Names of Historical Events Based on Chinese Character Tags[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[7] Guoming Feng,Xiaodong Zhang,Suhui Liu. DBLC Model for Word Segmentation Based on Autonomous Learning[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[8] Tingxin Wen,Yangzi Li,Jingshuang Sun. Extracting Text Features with Improved Fruit Fly Optimization Algorithm[J]. 数据分析与知识发现, 2018, 2(5): 59-69.
[9] Weijian Ni,Haohao Sun,Tong Liu,Qingtian Zeng. An Unsupervised Approach to Optimize Chinese Word Segmentation on Domain Literature[J]. 数据分析与知识发现, 2018, 2(2): 96-104.
[10] Fen Chen,Xi Fu,Yuan He,Chunxiang Xue. Identifying Weibo Opinion Leaders with Social Network Analysis and Influence Diffusion Model[J]. 数据分析与知识发现, 2018, 2(12): 60-67.
[11] Zhipeng Li,Weizhong Li. Feature Selection Based on Modified QPSO Algorithm[J]. 数据分析与知识发现, 2017, 1(7): 82-89.
[12] Xiaoyu Wang,Bin Li. Automatically Segmenting Middle Ancient Chinese Words with CRFs[J]. 数据分析与知识发现, 2017, 1(5): 62-70.
[13] Dongbo Wang,Yi Wu,Wenhao Ye,Ruilun Liu. Extracting Events of Food Safety Emergencies with Characteristics Knowledge[J]. 数据分析与知识发现, 2017, 1(3): 54-61.
[14] Lin Zhang,Ce Qin,Wenhao Ye. Automatic Recognition of Legal Language Entities Based on Conditional Random Fields[J]. 数据分析与知识发现, 2017, 1(11): 46-52.
[15] Xiangdong Li,Tao Ruan,Kang Liu. Automatic Classification of Documents from Wikipedia[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn