|
|
An Unsupervised Approach to Optimize Chinese Word Segmentation on Domain Literature |
Ni Weijian, Sun Haohao, Liu Tong(), Zeng Qingtian |
College of Computer Science and Technology, Shandong University of Science and Technology, Qingdao 266510, China |
|
|
Abstract [Objective] This paper aims to improve the performance of Chinese word segmentation techniques on domain literature by optimizing results of existing approaches. [Methods] First, we proposed a new criteria of Term Frequency Deviation (TFD) to capture word formation characteristics of domain literature based on the analysis of segmentation errors. Then, we developed an unsupervised segmentation refining approach with the help of TFD. [Results] We examined the proposed approach with agriculture documents. It improved the segmentation results of three popular Chinese word segmentation approaches (i.e., ICTCLAS, THULAC and LTP) by 2%~3% in F1 measure. The proposed approach was easy to use and robustness to parameters. [Limitations] The recall of the proposed approach needs to be improved. [Conclusions] The new Chinese word segmentation approach, which imrpoves the performance of traditional methods on domain literature, could be applied to other fields due to its independence of domain-specific vocabulary and annotated corpus.
|
Received: 28 September 2017
Published: 07 March 2018
|
|
[1] |
黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3): 8-19.
doi: 10.3969/j.issn.1003-0077.2007.03.002
|
[1] |
(Huang Changning, Zhao Hai.Chinese Word Segmentation: A Decade Review[J]. Journal of Chinese Information Processing, 2007, 21(3): 8-19.)
doi: 10.3969/j.issn.1003-0077.2007.03.002
|
[2] |
ICTCLAS 2016 [EB/OL]. [2016-05-31]. .
|
[3] |
THULAC[EB/OL]. [2016-03-27]. .
|
[4] |
语言技术平台云 [EB/OL]. [2015-10-31]. .
|
[4] |
(LTP-Cloud [EB/OL]. [2015-10-31].
|
[5] |
张桂平, 刘东生, 尹宝生, 等. 面向专利文献的中文分词技术的研究[J]. 中文信息学报, 2010, 24(3): 112-117.
doi: 10.3969/j.issn.1003-0077.2010.03.017
|
[5] |
(Zhang Guiping, Liu Dongsheng, Yin Baosheng, et al.Research on Chinese Word Segmentation for Patent Documents[J]. Journal of Chinese Information Processing, 2010, 24(3): 112-117.)
doi: 10.3969/j.issn.1003-0077.2010.03.017
|
[6] |
岳金媛, 徐金安, 张玉洁. 面向专利文献的汉语分词技术研究[J]. 北京大学学报, 2013, 49(1): 159-164.
|
[6] |
(Yue Jinyuan, Xu Jin’an, Zhang Yujie.Chinese Word Segmentation for Patent Documents[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2013, 49(1): 159-164.)
|
[7] |
张杰, 张海超, 翟东升. 面向中文专利权利要求书的分词方法研究[J]. 现代图书情报技术, 2014(9): 91-98.
|
[7] |
(Zhang Jie, Zhang Haichao, Zhai Dongsheng.Research of the Word Segmentation for Chinese Patent Claims[J]. New Technology of Library and Information Service, 2014(9): 91-98.)
|
[8] |
Li S, Xue N.Effective Document-Level Features for Chinese Patent Word Segmentation[C]//Proceedings of the 52nd Annual Meeting of the ACL. 2014:199-205.
|
[9] |
王军辉, 胡铁军, 李丹亚, 等. 中文生物医学文本无词典分词方法研究[J]. 情报学报, 2011, 30(2): 197-203.
doi: 10.3772/j.issn.1000-0135.2011.02.012
|
[9] |
(Wang Junhui, Hu Tiejun, Li Danya, et al.Research on Method for Chinese Word Segmentation Without Thesaurus in Chinese Biomedical Text[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(2): 197-203.)
doi: 10.3772/j.issn.1000-0135.2011.02.012
|
[10] |
李国垒, 陈先来, 夏冬, 等. 中文病历文本分词方法研究[J]. 中国生物医学工程学报, 2016, 35(4): 477-481.
|
[10] |
(Li Guolei, Chen Xianlai, Xia Dong, et al.Research on Segmentation of Chinese Text in Medical Record[J]. Chinese Journal of Biomedical Engineering, 2016, 35(4): 477-481.)
|
[11] |
王晓玉, 李斌. 基于CRFs和词典信息的中古汉语自动分词[J]. 数据分析与知识发现, 2017, 1(5): 62-70.
|
[11] |
(Wang Xiaoyu, Li Bin.Automatically Segmenting Middle Ancient Chinese Words with CRFs[J]. Data Analysis and Knowledge Discovery, 2017, 1(5): 62-70.)
|
[12] |
黄水清, 王东波, 何琳. 以《汉学引得丛刊》为领域词表的先秦典籍自动分词探讨[J]. 图书情报工作, 2015, 59(11): 127-133.
doi: 10.13266/j.issn.0252-3116.2015.11.018
|
[12] |
(Huang Shuiqing, Wang Dongbo, He Lin.Exploring of Word Segmentation for For-Qin Literature Based on the Domain Glossary of Sinological Index Series[J]. Library and Information Service, 2015, 59(11): 127-133.)
doi: 10.13266/j.issn.0252-3116.2015.11.018
|
[13] |
张越, 王东波, 朱丹浩. 面向食品安全突发事件汉语分词的特征选择及模型优化研究[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
|
[13] |
(Zhang Yue, Wang Dongbo, Zhu Danhao.Segmenting Chinese Words from Food Safety Emergencies[J]. Data Analysis and Knowledge Discovery, 2017, 1(2): 64-72.)
|
[14] |
张琳, 秦策, 叶文豪. 基于条件随机场的法言法语实体自动识别模型研究[J]. 数据分析与知识发现, 2017, 1(11): 46-52
|
[14] |
(Zhang Lin, Qin Ce, Ye Wenhao.Automatic Recognition of Legal Language Entities Based on Conditional Random Fields[J]. Data Analysis and Knowledge Discovery, 2017, 1(11): 46-52.)
|
[15] |
石崇德, 王惠临. 统计机器翻译中文分词优化技术研究[J]. 现代图书情报技术, 2012(4): 29-34.
|
[15] |
(Shi Chongde, Wang Huilin.Research on Chinese Word Segmentation Optimization in Statistical Machine Translation[J]. New Technology of Library and Information Service, 2012(4): 29-34.)
|
[16] |
韩冬煦, 常宝宝. 中文分词模型的领域适应性方法[J]. 计算机学报, 2015, 38(2): 272-281.
doi: 10.3724/SP.J.1016.2015.00272
|
[16] |
(Han Dongxu, Chang Baobao.Approches to Domain Adaptive Chinese Segmetation Model[J]. Chinese Journal of Computers, 2015, 38(2): 272-281.)
doi: 10.3724/SP.J.1016.2015.00272
|
[17] |
Zeng D, Wei D, Chau M, et al.Domain-specific Chinese Word Segmentation Using Suffix Tree and Mutual Information[J]. Information Systems Frontiers, 2011, 13(1): 115-125.
doi: 10.1007/s10796-010-9278-5
|
[18] |
Song Y, Xia F.Using a Goodness Measurement for Domain Adaptation: A Case Study on Chinese Word Segmentation[C]//Proceedings of the 6th Language Resources and Evaluation Conference. 2012: 3853-3860.
|
[19] |
谷俊, 王昊. 基于领域中文文本的术语抽取方法研究[J]. 现代图书情报技术, 2011(4): 29-34.
|
[19] |
(Gu Jun, Wang Hao.Study on Term Extraction on the Basis of Chinese Domain Texts[J]. New Technology of Library and Information Service, 2011(4): 29-34.)
|
[20] |
许华婷, 张玉洁, 杨晓晖, 等. 基于Active Learning的中文分词领域自适应[J]. 中文信息学报, 2015, 29(5): 55-63.
|
[20] |
(Xu Huating, Zhang yujie, Yang Xiaohui, et al. Active Learning Based Domain Adaptation for Chinese Word Segmentation[J]. Journal of Chinese Information Processing, 2015, 29(5): 55-63.)
|
[21] |
Liu Y, Zhang Y, Che W, et al.Domain Adaptation for CRF-based Chinese Word Segmentation Using Free Annotations[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 864-874.
|
[22] |
张梅山, 邓知龙, 车万翔, 等. 统计与词典相结合的领域自适应中文分词[J]. 中文信息学报, 2012, 26(2): 8-13.
doi: 10.3969/j.issn.1003-0077.2012.02.002
|
[22] |
(Zhang Meishan, Deng Zhilong, Che Wanxiang, et al.Combing Statistical Model and Dictionary for Domain Adaption of Chinese Word Segmentation[J]. Journal of Chinese Information Processing, 2012, 26(2): 8-13.)
doi: 10.3969/j.issn.1003-0077.2012.02.002
|
[23] |
Beeferman D, Berger A, Lafferty J.Statistical Models for Text Segmentation[J]. Machine Learning, 1999, 34(1-3): 177-210.
doi: 10.1023/A:1007506220214
|
[24] |
俞士汶, 段慧明, 朱学锋, 等. 北京大学现代汉语语料库基本加工规范(续)[J]. 中文信息学报, 2002, 16(6): 58-65.
|
[24] |
(Yu Shiwen, Duan Huiming, Zhu Xuefeng, et al.The Basic Processing of Contemporary Chinese Corpus at Peking University Specification[J]. Journal of Chinese Information Processing, 2002, 16(6): 58-65.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|