Please wait a minute...
Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (2): 96-104    DOI: 10.11925/infotech.2096-3467.2017.0990
Current Issue | Archive | Adv Search |
An Unsupervised Approach to Optimize Chinese Word Segmentation on Domain Literature
Ni Weijian, Sun Haohao, Liu Tong(), Zeng Qingtian
College of Computer Science and Technology, Shandong University of Science and Technology, Qingdao 266510, China
Download: PDF (1111 KB)   HTML ( 5
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to improve the performance of Chinese word segmentation techniques on domain literature by optimizing results of existing approaches. [Methods] First, we proposed a new criteria of Term Frequency Deviation (TFD) to capture word formation characteristics of domain literature based on the analysis of segmentation errors. Then, we developed an unsupervised segmentation refining approach with the help of TFD. [Results] We examined the proposed approach with agriculture documents. It improved the segmentation results of three popular Chinese word segmentation approaches (i.e., ICTCLAS, THULAC and LTP) by 2%~3% in F1 measure. The proposed approach was easy to use and robustness to parameters. [Limitations] The recall of the proposed approach needs to be improved. [Conclusions] The new Chinese word segmentation approach, which imrpoves the performance of traditional methods on domain literature, could be applied to other fields due to its independence of domain-specific vocabulary and annotated corpus.

Key wordsDomain Literature      Chinese Word Segmentation      Segmentation Refining      Term Frequency Deviation     
Received: 28 September 2017      Published: 07 March 2018
ZTFLH:  TP391 G35  

Cite this article:

Ni Weijian,Sun Haohao,Liu Tong,Zeng Qingtian. An Unsupervised Approach to Optimize Chinese Word Segmentation on Domain Literature. Data Analysis and Knowledge Discovery, 2018, 2(2): 96-104.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.0990     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2018/V2/I2/96

文档数 句子数 字数 词数
ICTCLAS 10 697 852 051 6 351 134 4 275 300
THULAC 4 078 779
LTP 4 008 380
人工标注 500 40 173 298 434 223 011
Approach ICTCLAS THULAC LTP
Precision Recall F1 Precision Recall F1 Precision Recall F1
BASEICTCLAS 74.60% 97.72% 84.61% 76.62% 94.08% 84.46% 77.84% 92.57% 84.57%
TFD 83.49% 89.18% 86.24% 82.09% 90.66% 86.16% 82.52% 90.28% 86.22%
rTFD 82.98% 92.12% 87.31% 83.29% 90.27% 86.65% 83.66% 89.71% 86.58%
MI 83.31% 84.98% 84.13% 83.26% 83.06% 83.16% 83.12% 85.20% 84.14%
rMI 84.47% 87.70% 86.05% 79.03% 91.85% 84.96% 81.08% 89.98% 85.30%
[1] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3): 8-19.
doi: 10.3969/j.issn.1003-0077.2007.03.002
[1] (Huang Changning, Zhao Hai.Chinese Word Segmentation: A Decade Review[J]. Journal of Chinese Information Processing, 2007, 21(3): 8-19.)
doi: 10.3969/j.issn.1003-0077.2007.03.002
[2] ICTCLAS 2016 [EB/OL]. [2016-05-31]. .
[3] THULAC[EB/OL]. [2016-03-27]. .
[4] 语言技术平台云 [EB/OL]. [2015-10-31]. .
[4] (LTP-Cloud [EB/OL]. [2015-10-31].
[5] 张桂平, 刘东生, 尹宝生, 等. 面向专利文献的中文分词技术的研究[J]. 中文信息学报, 2010, 24(3): 112-117.
doi: 10.3969/j.issn.1003-0077.2010.03.017
[5] (Zhang Guiping, Liu Dongsheng, Yin Baosheng, et al.Research on Chinese Word Segmentation for Patent Documents[J]. Journal of Chinese Information Processing, 2010, 24(3): 112-117.)
doi: 10.3969/j.issn.1003-0077.2010.03.017
[6] 岳金媛, 徐金安, 张玉洁. 面向专利文献的汉语分词技术研究[J]. 北京大学学报, 2013, 49(1): 159-164.
[6] (Yue Jinyuan, Xu Jin’an, Zhang Yujie.Chinese Word Segmentation for Patent Documents[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2013, 49(1): 159-164.)
[7] 张杰, 张海超, 翟东升. 面向中文专利权利要求书的分词方法研究[J]. 现代图书情报技术, 2014(9): 91-98.
[7] (Zhang Jie, Zhang Haichao, Zhai Dongsheng.Research of the Word Segmentation for Chinese Patent Claims[J]. New Technology of Library and Information Service, 2014(9): 91-98.)
[8] Li S, Xue N.Effective Document-Level Features for Chinese Patent Word Segmentation[C]//Proceedings of the 52nd Annual Meeting of the ACL. 2014:199-205.
[9] 王军辉, 胡铁军, 李丹亚, 等. 中文生物医学文本无词典分词方法研究[J]. 情报学报, 2011, 30(2): 197-203.
doi: 10.3772/j.issn.1000-0135.2011.02.012
[9] (Wang Junhui, Hu Tiejun, Li Danya, et al.Research on Method for Chinese Word Segmentation Without Thesaurus in Chinese Biomedical Text[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(2): 197-203.)
doi: 10.3772/j.issn.1000-0135.2011.02.012
[10] 李国垒, 陈先来, 夏冬, 等. 中文病历文本分词方法研究[J]. 中国生物医学工程学报, 2016, 35(4): 477-481.
[10] (Li Guolei, Chen Xianlai, Xia Dong, et al.Research on Segmentation of Chinese Text in Medical Record[J]. Chinese Journal of Biomedical Engineering, 2016, 35(4): 477-481.)
[11] 王晓玉, 李斌. 基于CRFs和词典信息的中古汉语自动分词[J]. 数据分析与知识发现, 2017, 1(5): 62-70.
[11] (Wang Xiaoyu, Li Bin.Automatically Segmenting Middle Ancient Chinese Words with CRFs[J]. Data Analysis and Knowledge Discovery, 2017, 1(5): 62-70.)
[12] 黄水清, 王东波, 何琳. 以《汉学引得丛刊》为领域词表的先秦典籍自动分词探讨[J]. 图书情报工作, 2015, 59(11): 127-133.
doi: 10.13266/j.issn.0252-3116.2015.11.018
[12] (Huang Shuiqing, Wang Dongbo, He Lin.Exploring of Word Segmentation for For-Qin Literature Based on the Domain Glossary of Sinological Index Series[J]. Library and Information Service, 2015, 59(11): 127-133.)
doi: 10.13266/j.issn.0252-3116.2015.11.018
[13] 张越, 王东波, 朱丹浩. 面向食品安全突发事件汉语分词的特征选择及模型优化研究[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[13] (Zhang Yue, Wang Dongbo, Zhu Danhao.Segmenting Chinese Words from Food Safety Emergencies[J]. Data Analysis and Knowledge Discovery, 2017, 1(2): 64-72.)
[14] 张琳, 秦策, 叶文豪. 基于条件随机场的法言法语实体自动识别模型研究[J]. 数据分析与知识发现, 2017, 1(11): 46-52
[14] (Zhang Lin, Qin Ce, Ye Wenhao.Automatic Recognition of Legal Language Entities Based on Conditional Random Fields[J]. Data Analysis and Knowledge Discovery, 2017, 1(11): 46-52.)
[15] 石崇德, 王惠临. 统计机器翻译中文分词优化技术研究[J]. 现代图书情报技术, 2012(4): 29-34.
[15] (Shi Chongde, Wang Huilin.Research on Chinese Word Segmentation Optimization in Statistical Machine Translation[J]. New Technology of Library and Information Service, 2012(4): 29-34.)
[16] 韩冬煦, 常宝宝. 中文分词模型的领域适应性方法[J]. 计算机学报, 2015, 38(2): 272-281.
doi: 10.3724/SP.J.1016.2015.00272
[16] (Han Dongxu, Chang Baobao.Approches to Domain Adaptive Chinese Segmetation Model[J]. Chinese Journal of Computers, 2015, 38(2): 272-281.)
doi: 10.3724/SP.J.1016.2015.00272
[17] Zeng D, Wei D, Chau M, et al.Domain-specific Chinese Word Segmentation Using Suffix Tree and Mutual Information[J]. Information Systems Frontiers, 2011, 13(1): 115-125.
doi: 10.1007/s10796-010-9278-5
[18] Song Y, Xia F.Using a Goodness Measurement for Domain Adaptation: A Case Study on Chinese Word Segmentation[C]//Proceedings of the 6th Language Resources and Evaluation Conference. 2012: 3853-3860.
[19] 谷俊, 王昊. 基于领域中文文本的术语抽取方法研究[J]. 现代图书情报技术, 2011(4): 29-34.
[19] (Gu Jun, Wang Hao.Study on Term Extraction on the Basis of Chinese Domain Texts[J]. New Technology of Library and Information Service, 2011(4): 29-34.)
[20] 许华婷, 张玉洁, 杨晓晖, 等. 基于Active Learning的中文分词领域自适应[J]. 中文信息学报, 2015, 29(5): 55-63.
[20] (Xu Huating, Zhang yujie, Yang Xiaohui, et al. Active Learning Based Domain Adaptation for Chinese Word Segmentation[J]. Journal of Chinese Information Processing, 2015, 29(5): 55-63.)
[21] Liu Y, Zhang Y, Che W, et al.Domain Adaptation for CRF-based Chinese Word Segmentation Using Free Annotations[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 864-874.
[22] 张梅山, 邓知龙, 车万翔, 等. 统计与词典相结合的领域自适应中文分词[J]. 中文信息学报, 2012, 26(2): 8-13.
doi: 10.3969/j.issn.1003-0077.2012.02.002
[22] (Zhang Meishan, Deng Zhilong, Che Wanxiang, et al.Combing Statistical Model and Dictionary for Domain Adaption of Chinese Word Segmentation[J]. Journal of Chinese Information Processing, 2012, 26(2): 8-13.)
doi: 10.3969/j.issn.1003-0077.2012.02.002
[23] Beeferman D, Berger A, Lafferty J.Statistical Models for Text Segmentation[J]. Machine Learning, 1999, 34(1-3): 177-210.
doi: 10.1023/A:1007506220214
[24] 俞士汶, 段慧明, 朱学锋, 等. 北京大学现代汉语语料库基本加工规范(续)[J]. 中文信息学报, 2002, 16(6): 58-65.
[24] (Yu Shiwen, Duan Huiming, Zhu Xuefeng, et al.The Basic Processing of Contemporary Chinese Corpus at Peking University Specification[J]. Journal of Chinese Information Processing, 2002, 16(6): 58-65.)
[1] Feng Guoming,Zhang Xiaodong,Liu Suhui. DBLC Model for Word Segmentation Based on Autonomous Learning[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[2] Zhang Yue,Wang Dongbo,Zhu Danhao. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[3] Yu Xincong, Li Honglian, Lv Xueqiang. Research on the Application of Hyponymy in the Enrollment Robot[J]. 现代图书情报技术, 2015, 31(12): 65-71.
[4] Zhang Jie, Zhang Haichao, Zhai Dongsheng. Research of the Word Segmentation for Chinese Patent Claims[J]. 现代图书情报技术, 2014, 30(9): 91-98.
[5] Li Wenjiang, Chen Shiqin. Application of AIMLBot Intelligent Robot in Real-time Virtual Reference Service[J]. 现代图书情报技术, 2012, 28(7): 127-132.
[6] Jiang Hua, Su Xiaoguang. Chinese High-frequency Words Extraction Algorithm Without Thesaurus[J]. 现代图书情报技术, 2012, 28(6): 50-53.
[7] Shi Chongde, Wang Huilin. Research on Chinese Word Segmentation Optimization in Statistical Machine Translation[J]. 现代图书情报技术, 2012, 28(4): 29-34.
[8] Gu Jun, Wang Hao. Study on Term Extraction on the Basis of Chinese Domain Texts[J]. 现代图书情报技术, 2011, 27(4): 29-34.
[9] Xie Hui,Qin Jie,Hu Shuangshuang. The Study on the Duplicated Web Pages Detection Algorithm Based on the Keyword from User’s Submission[J]. 现代图书情报技术, 2008, 24(7): 43-46.
[10] Zhang Jinzhu,Zhang Dong,Wang Huilin. The Research of Character-Position-Based Chinese Word Segmentation[J]. 现代图书情报技术, 2008, 24(5): 39-43.
[11] Yao Xingshan. The Improvement in a Chinese Word Segmentation Based on Hash Algorism[J]. 现代图书情报技术, 2008, 24(3): 78-81.
[12] Wu Shaogen . Study of Scheme Automaton for Chinese Word Automatic Segmentation[J]. 现代图书情报技术, 2006, 1(5): 47-49.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn