Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (2): 64-72    DOI: 10.11925/infotech.2096-3467.2017.02.09
Orginal Article Current Issue | Archive | Adv Search |
Segmenting Chinese Words from Food Safety Emergencies
Zhang Yue1, Wang Dongbo1,2(), Zhu Danhao3
1College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China
2Research Center for Correlation of Domain Knowledge, Nanjing Agricultural University, Nanjing 210095, China
3Library of Jiangsu Police Institute, Nanjing 210031, China
Download: PDF (1706 KB)   HTML ( 26
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper examines the automatic word segmentation models, which plays key roles to build databases for food safety administration. We used the statistical learning method based on conditional random field to segment words from food safety emergencies. [Methods] First, we analyzed the length of target words and conducted multiple experiments on the selection and template of word features for the automatic segmentation methods. Second, we identified the impacts of different features and templates to the segmentation results. [Results] We found that selecting more features might not yield better results due to the characteristics interference. About 46.62% of the phrases from the corpus of food safety emergencies only contained two or three words. The first words before and after the current word of the features template pose more effects to the results. [Conclusions] We have identified the optimal feature and template for the automatic segmentation of words and the F score reaches 92.88% with the 5Tag features.

Key wordsChinese Word Segmentation      Food Safety      Conditional Random Field      Feature Template      Feature Selection     
Received: 22 September 2016      Published: 27 March 2017
ZTFLH:  G351  

Cite this article:

Zhang Yue,Wang Dongbo,Zhu Danhao. Segmenting Chinese Words from Food Safety Emergencies. Data Analysis and Knowledge Discovery, 2017, 1(2): 64-72.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.02.09     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I2/64

文本语料 正确标记 CRF输出标记 文本语料 正确标记 CRF输出标记
S S S S
B S B B
E S I E
S S M B
B B E E
E E S S
标记类型 标记描述
4Tag
{B, M, E, S}
B表示词首字, M表示词中字, E表示词尾字,
S表示单字词字。
5Tag
{B, I, M, E, S}
B表示词首字, I表示四字以上词首后第一个字, M表示词中, E表示词尾字, S表示单字词字。
6Tag
{B, I, J, M, E, S}
B表示词首字, I表示四字以上词首后第一个字, J表示五字以上词首后第二个字, M表示词中字, E表示词尾字, S表示单字词字。
特征标记 标记数量 标记所占百分比
B 597 343.7 30.22%
M 158 744.3 8.03%
E 597 343.7 30.21%
S 623 538.5 31.54%
特征标记 标记数量 标记所占百分比
B 597 343.7 30.22%
I 28 529 1.44%
M 130 215.3 6.59%
E 597 343.7 30.21%
S 623 538.5 31.54%
特征标记 标记数量 标记所占百分比
B 597 343.7 30.22%
I 28 529 1.44%
J 11 595.4 0.59%
M 118 619.9 6.00%
E 597 343.7 30.21%
S 623 538.5 31.54%
食品安全语料 字音特征 词长特征 位置特征
mei 2 B
ti 2 E
diao 2 B
cha 2 E
jie 2 B
tou 2 E
liang 3 B
ban 3 M
cai 3 E
yuan 2 B
liao 2 E
bu 2 B
fen 2 E
wei 1 S
ren 2 B
zao 2 E
huo 1 S
han 1 S
tian 3 B
jia 3 M
ji 3 E
特征选择 P值 R值 F值
4Tag 92.85% 92.89% 92.87%
4Tag+词长 92.74% 92.78% 92.76%
4Tag+字音 92.53% 92.57% 92.55%
4Tag+词长+字音 92.67% 92.69% 92.68%
5Tag 92.85% 92.90% 92.88%
5Tag+词长 92.64% 92.69% 92.67%
5Tag+字音 92.32% 92.38 92.35%
5Tag+词长+字音 92.02% 92.08% 92.05%
6Tag 92.20% 92.11% 92.16%
6Tag+词长 92.09% 92.00% 92.04%
6Tag+字音 92.00% 91.90% 91.95%
6Tag+词长+字音 91.71% 91.60% 91.65%
特征 特征模板 特征描述
C-2 U01:%x[-2, 0] 当前字的前驱第二个字
C-1 U02:%x[-1, 0] 当前字的前驱第一个字
C0 U03:%x[0, 0] 当前字
C1 U04:%x[1, 0] 当前字的后驱第一个字
C2 U05:%x[2, 0] 当前字的后驱第二个字
C-1C0 U06:%x[-1, 0]/%x[0, 0] 前一个字到当前字的转移概率
C0C1 U07:%x[0, 0]/%x[1, 0] 当前字到后一个字的转移概率
C-1C1 U08:%x[-1, 0]/%x[1, 0] 前一个字到后一个字的转移概率
特征模板(对比表8) F值
原始特征模板 92.88%
移除一元特征C-2、C2、C-1、C1 92.72%
移除二元特征C-1C0、C0C1、C-1C1 86.33%
增加一元特征C-3、C3 92.73%
增加二元特征C1C2、C-1C-2 92.56%
词类型 词长度 所占百分比
单字词 1 039 205 51.10%
二字词 841 690 41.39%
三字词 106 307 5.23%
四字词 28 220 1.39%
五字词 8 893 0.44%
六字词 2 626 0.13%
其他 6 598 0.32%
[1] 李洪峰. 食品安全社会共治的现实困境与发展对策[J]. 食品与机械, 2016, 32(4): 234-236.
[1] (Li Hongfeng.Analysis of Realistic Plights and Countermeasures in Social Co- governance on Food Safety in China[J]. Food & Machinery, 2016, 32(4): 234-236.)
[2] 王辉霞. 公众参与食品安全治理法治探析[J]. 商业研究, 2012(4): 170-177.
doi: 10.3969/j.issn.1001-148X.2012.04.028
[2] (Wang Huixia.Public Participation in Food Safety Management of the Rule of Law[J]. Commercial Research, 2012(4): 170-177.)
doi: 10.3969/j.issn.1001-148X.2012.04.028
[3] 奉国和, 郑伟.国内中文自动分词技术研究综述[J].图书情报工作, 2011, 55(2): 41-45.
[3] (Feng Guohe, Zheng Wei.Review of Chinese Automatic Word Segmentation[J]. Library and Information Service, 2011, 55(2): 41-45.)
[4] 张星联, 唐晓纯. 我国食品安全预警数据库系统的建设与实现[J]. 食品科技, 2008, 33(12): 250-254.
doi: 10.3969/j.issn.1005-9989.2008.12.065
[4] (Zhang Xinglian, Tang Xiaochun.Establishment on Database System of Food Safety Early-warning in China[J]. Food Science and Technology, 2008, 33(12): 250-254.)
doi: 10.3969/j.issn.1005-9989.2008.12.065
[5] 吴云红, 朱亮, 初炜, 等. 食品监管改革的关键——基于互联网的动态第三方数据库[J]. 食品工业科技, 2009(9): 272-274.
[5] (Wu Yunhong, Zhu Liang, Chu Wei, et al.Key of Food Supervision and Administration Reform-dynamic and Third Party Database Based on Internet[J]. Science and Technology of Food Industry, 2009 (9): 272-274.)
[6] 余清, 洪源. 加工食品风险数据库的构建思路[J]. 价值工程, 2013(30): 174-175.
doi: 10.3969/j.issn.1006-4311.2013.30.092
[6] (Yu Qing, Hong Yuan.Construction Idea for Risk Database of Processed Food[J]. Value Engineering, 2013(30): 174-175.)
doi: 10.3969/j.issn.1006-4311.2013.30.092
[7] 贾凯, 彭培好, 阮伟玲. 四川省彭州市三界镇农民专业合作社调查研究[J].北京农业, 2014(3): 247-248.
doi: 10.3969/j.issn.1000-6966.2014.03.190
[7] (Jia Kai, Peng Peihao, Ruan Weiling.Study on the Investigation of Farmer Cooperatives in Sanjie Town, Pengzhou City, Sichuan Province[J]. Beijing Agriculture, 2014(3): 247-248.)
doi: 10.3969/j.issn.1000-6966.2014.03.190
[8] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3): 8-19.
doi: 10.3969/j.issn.1003-0077.2007.03.002
[8] (Huang Changning, Zhao Hai.Chinese Word Segmentation: A Decade Review[J]. Journal of Chinese Information Processing, 2007, 21(3): 8-19.)
doi: 10.3969/j.issn.1003-0077.2007.03.002
[9] Zeng D, Wei D, Chau M, et al.Domain-specific Chinese Word Segmentation Using Suffix Tree and Mutual Information[J]. Information Systems Frontiers, 2011, 13(1): 115-125.
doi: 10.1007/s10796-010-9278-5
[10] 刘泽文, 丁冬, 李春文. 基于条件随机场的中文短文本分词方法[J]. 清华大学学报:自然科学版, 2015, 55(8): 16-20.
[10] (Liu Zewen, Ding Dong, Li Chunwen.Chinese Word Segmentation Method for Short Chinese Text Based on Conditional Random Fields[J]. Journal of Tsinghua University:Science and Technology, 2015, 55(8): 16-20.)
[11] Lafferty J D, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]//Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[12] Pearl J.Bayes and Markov Networks:A Comparison of Two Graphical Representations of Probabilistic Knowledge [R]. Los Angeles, California, USA: University of California, 1986.
[13] Wallach H M.Conditional Random Fields: An Introduction [EB/OL]. (2004-02-24). .
[14] CRF++: Yet Another CRF Toolkit [EB/OL]. [2014-08-04]. .
[15] 中国科学院计算技术研究所. ICTCLAS汉语分词系统 [CP/OL]. (2016-02-17). [2016-06-30]. .
[15] (Institute of Computing Technology of the Chinese Academy of Sciences. ICTCLAS Chinese Word Segmentation System [CP/OL]. (2016-02-17). [2016-06-30].
[16] 岳金媛, 徐金安, 张玉洁. 面向专利文献的汉语分词技术研究[J]. 北京大学学报: 自然科学版, 2013, 49(1): 159-164 .
[16] (Yue Jinyuan, Xu Jin’an, Zhang Yujie.Chinese Word Segmentation for Patent Documents[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2013, 49(1): 159-164.)
[17] Chen L, Li M, Zhang J, et al.A Double-Layer Word Segmentation Combined with Local Ambiguity Word Grid and CRF[J]. Transactions on Computer Science & Technology, 2013, 2(1): 1-8.
[18] 黄水清, 王东波, 何琳. 以《汉学引得丛刊》为领域词表的先秦典籍自动分词探讨[J]. 图书情报工作, 2015, 59(11): 127-133.
doi: 10.13266/j.issn.0252-3116.2015.11.018
[18] (Huang Shuiqing, Wang Dongbo, He Lin.Exploring of Word Segmentation for Fore-Qin Literature Based on the Domain Glossary of Sinological Index Series[J]. Library and Information Service, 2015, 59(11): 127-133.)
doi: 10.13266/j.issn.0252-3116.2015.11.018
[19] Zhao H, Huang C N, Li M, et al.An Improved Chinese Word Segmentation System with Conditional Random Field[C]// Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing.2006: 162-165.
[1] Cheng Bin,Shi Shuicai,Du Yuncheng,Xiao Shibin. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[2] Hu Haotian,Ji Jinfeng,Wang Dongbo,Deng Sanhong. An Integrated Platform for Food Safety Incident Entities Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[3] Liang Jiaming, Zhao Jie, Zheng Peng, Huang Liushen, Ye Minqi, Dong Zhenning. Framework for Computing Trust in Online Short-Rent Platform Using Feature Selection of Images and Texts[J]. 数据分析与知识发现, 2021, 5(2): 129-140.
[4] Zhao Ping,Sun Lianying,Tu Shuai,Bian Jianling,Wan Ying. Identifying Scenic Spot Entities Based on Improved Knowledge Transfer[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
[5] Li Chengliang,Zhao Zhongying,Li Chao,Qi Liang,Wen Yan. Extracting Product Properties with Dependency Relationship Embedding and Conditional Random Field[J]. 数据分析与知识发现, 2020, 4(5): 54-65.
[6] Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[7] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[8] Jiaming Liang,Jie Zhao,Zhou Jianlong,Zhenning Dong. Detecting Collusive Fraudulent Online Transaction with Implicit User Behaviors[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
[9] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[10] Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[11] Tang Huihui,Wang Hao,Zhang Zixuan,Wang Xueying. Extracting Names of Historical Events Based on Chinese Character Tags[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[12] Feng Guoming,Zhang Xiaodong,Liu Suhui. DBLC Model for Word Segmentation Based on Autonomous Learning[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[13] Wen Tingxin,Li Yangzi,Sun Jingshuang. Extracting Text Features with Improved Fruit Fly Optimization Algorithm[J]. 数据分析与知识发现, 2018, 2(5): 59-69.
[14] Ni Weijian,Sun Haohao,Liu Tong,Zeng Qingtian. An Unsupervised Approach to Optimize Chinese Word Segmentation on Domain Literature[J]. 数据分析与知识发现, 2018, 2(2): 96-104.
[15] Chen Fen,Fu Xi,He Yuan,Xue Chunxiang. Identifying Weibo Opinion Leaders with Social Network Analysis and Influence Diffusion Model[J]. 数据分析与知识发现, 2018, 2(12): 60-67.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn