[Objective] This paper aims to improve the performance of Chinese word segmentation techniques on domain literature by optimizing results of existing approaches. [Methods] First, we proposed a new criteria of Term Frequency Deviation (TFD) to capture word formation characteristics of domain literature based on the analysis of segmentation errors. Then, we developed an unsupervised segmentation refining approach with the help of TFD. [Results] We examined the proposed approach with agriculture documents. It improved the segmentation results of three popular Chinese word segmentation approaches (i.e., ICTCLAS, THULAC and LTP) by 2%~3% in F1 measure. The proposed approach was easy to use and robustness to parameters. [Limitations] The recall of the proposed approach needs to be improved. [Conclusions] The new Chinese word segmentation approach, which imrpoves the performance of traditional methods on domain literature, could be applied to other fields due to its independence of domain-specific vocabulary and annotated corpus.
倪维健, 孙浩浩, 刘彤, 曾庆田. 面向领域文献的无监督中文分词自动优化方法*[J]. 数据分析与知识发现, 2018, 2(2): 96-104.
Ni Weijian,Sun Haohao,Liu Tong,Zeng Qingtian. An Unsupervised Approach to Optimize Chinese Word Segmentation on Domain Literature. Data Analysis and Knowledge Discovery, 2018, 2(2): 96-104.
(Zhang Guiping, Liu Dongsheng, Yin Baosheng, et al.Research on Chinese Word Segmentation for Patent Documents[J]. Journal of Chinese Information Processing, 2010, 24(3): 112-117.)
(Wang Junhui, Hu Tiejun, Li Danya, et al.Research on Method for Chinese Word Segmentation Without Thesaurus in Chinese Biomedical Text[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(2): 197-203.)
(Huang Shuiqing, Wang Dongbo, He Lin.Exploring of Word Segmentation for For-Qin Literature Based on the Domain Glossary of Sinological Index Series[J]. Library and Information Service, 2015, 59(11): 127-133.)
(Xu Huating, Zhang yujie, Yang Xiaohui, et al. Active Learning Based Domain Adaptation for Chinese Word Segmentation[J]. Journal of Chinese Information Processing, 2015, 29(5): 55-63.)
Liu Y, Zhang Y, Che W, et al.Domain Adaptation for CRF-based Chinese Word Segmentation Using Free Annotations[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 864-874.
(Zhang Meishan, Deng Zhilong, Che Wanxiang, et al.Combing Statistical Model and Dictionary for Domain Adaption of Chinese Word Segmentation[J]. Journal of Chinese Information Processing, 2012, 26(2): 8-13.)
Beeferman D, Berger A, Lafferty J.Statistical Models for Text Segmentation[J]. Machine Learning, 1999, 34(1-3): 177-210.