|
|
Research on Chinese Word Segmentation Optimization in Statistical Machine Translation |
Shi Chongde, Wang Huilin |
Institute of Scientific & Technical Information of China, Beijing 100038, China |
|
|
Abstract This paper analyzes the different segmentation approaches and how they work on word alignment of Statistical Machine Translation (SMT). Then it proposes two optimization methods of Chinese Word Segmentation(CWS) based on granularity constraint and sub-word tagging. Experiment results show that these methods can improve the quality of machine translation.
|
Received: 23 February 2012
Published: 20 May 2012
|
|
[1] Koehn P. Statistical Machine Translation[M]. Cambridge University Press, 2010.[2] Zhang R Q, Yasuda K, Sumita E. Chinese Word Segmentation and Statistical Machine Translation[J]. ACM Transactions on Speech and Language Processing (TSLP), 2008, 5(2): 1-19.[3] Zhang R Q, Yasuda K, Sumita E. Improved Statistical Machine Translation by Multiple Chinese Word Segmentation[C].In: Proceedings of the 3rd Workshop on Statistical Machine Translation. 2008: 216-223.[4] Chang P C, Galley M, Manning C D. Optimizing Chinese Word Segmentation for Machine Translation Performance [C]. In:Proceedings of the 3rd Workshop on Statistical Machine Translation. 2008: 224-232.[5] Paul M, Finch A, Sumita E. Language Independent Word Segmentation for Statistical Machine Translation[C]. In:Proceedings of the 3rd International Universal Communication Symposium. 2009:36-40.[6] Moses Statistical Machine Translation System . [2012-03-01] http://www.statmt.org/moses/.[7] Linguistic Data Consortium[DB/OL]. [2012-03-01]. http://www.ldc.upenn.edu/.[8] NIST Open Machine Translation (OpenMT) Evaluation[DB/OL]. [2012-03-01]. http://www.itl.nist.gov/iad/mig/tests/mt/.[9] Xu J, Zens R, Ney H. Do We Need Chinese Word Segmentation for Statistical Machine Translation[C]. In: Proceedings of the 3rd SIGHAN Workshop on Chinese Language Learning. 2004: 122-128.[10] 孙茂松, 邹嘉彦. 汉语自动分词研究评述[J]. 当代语言学 , 2001, 3(1): 22-32.(Sun Maosong,Zou Jiayan. A Critical Appraisal of the Research on Chinese Word Segmentation[J]. Contemporary Linguistics,2001,3(1):22-32.)[11] Tseng H, Chang P, Andrew G, et al. A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005[C].In: Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 2005: 168-171.[12] ICTCLAS汉语分词系统 . [2012-03-01]. http://www.ictclas.org.(ICTCLAS Chinese Word Segmentation System . [2012-03-01]. http://www.ictclas.org.)[13] SIGHAN[DB/OL].[2012-03-01]. http://www.sighan.org/.[14] Xue N W, Shen L B. Chinese Word Segmentation as LMR Tagging[C]. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. 2003: 176-179.[15] Low J K, Ng H T, Guo W Y. A Maximum Entropy Approach to Chinese Word Segmentation[C].In: Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 2005: 161-164.[16] Zhao H, Huang C N, Li M. An Improved Chinese Word Segmentation System with Conditional Random Field[C]. In: Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. 2006: 162-165.[17] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报 , 2007, 21(3): 8-19.(Huang Changning,Zhao Hai. Chinese Word Segmentation:A Decade Review[J]. Journal of Chinese Information Processing,2007, 21(3): 8-19.)[18] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]. In: Proceedings of the 18th International Conference on Machine Learning (ICML-2001). 2001: 282-289.[19] Roth D, Yih W T. Integer Linear Programming Inference for Conditional Random Fields[C]. In: Proceedings of the 22nd International Conference on Machine Learning (ICML),Bonn,Germany. 2005: 737-744.[20] Zhang R Q, Kikui G, Sumita E. Subword-based Tagging by Conditional Random Fields for Chinese Word Segmentation[C].In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers(NAACL-Short '06). 2006: 193-196.[21] 赵海, 揭春雨. 基于有效子串标注的中文分词[J]. 中文信息学报 , 2007, 21(5): 8-13.(Zhao Hai,Jie Chunyu. Effective Subsequence-based Tagging for Chinese Word Segmentation[J]. Journal of Chinese Information Processing, 2007, 21(5): 8-13.) |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|