Please wait a minute...
New Technology of Library and Information Service  2012, Vol. 28 Issue (4): 29-34    DOI: 10.11925/infotech.1003-3513.2012.04.05
Current Issue | Archive | Adv Search |
Research on Chinese Word Segmentation Optimization in Statistical Machine Translation
Shi Chongde, Wang Huilin
Institute of Scientific & Technical Information of China, Beijing 100038, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  This paper analyzes the different segmentation approaches and how they work on word alignment of Statistical Machine Translation (SMT). Then it proposes two optimization methods of Chinese Word Segmentation(CWS) based on granularity constraint and sub-word tagging. Experiment results show that these methods can improve the quality of machine translation.
Key wordsChinese word segmentation      Machine translation      Granularity constraint      Sub-word tagging     
Received: 23 February 2012      Published: 20 May 2012
: 

TP391.2

 

Cite this article:

Shi Chongde, Wang Huilin. Research on Chinese Word Segmentation Optimization in Statistical Machine Translation. New Technology of Library and Information Service, 2012, 28(4): 29-34.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2012.04.05     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2012/V28/I4/29

[1] Koehn P. Statistical Machine Translation[M]. Cambridge University Press, 2010.

[2] Zhang R Q, Yasuda K, Sumita E. Chinese Word Segmentation and Statistical Machine Translation[J]. ACM Transactions on Speech and Language Processing (TSLP), 2008, 5(2): 1-19.

[3] Zhang R Q, Yasuda K, Sumita E. Improved Statistical Machine Translation by Multiple Chinese Word Segmentation[C].In: Proceedings of the 3rd Workshop on Statistical Machine Translation. 2008: 216-223.

[4] Chang P C, Galley M, Manning C D. Optimizing Chinese Word Segmentation for Machine Translation Performance [C]. In:Proceedings of the 3rd Workshop on Statistical Machine Translation. 2008: 224-232.

[5] Paul M, Finch A, Sumita E. Language Independent Word Segmentation for Statistical Machine Translation[C]. In:Proceedings of the 3rd International Universal Communication Symposium. 2009:36-40.

[6] Moses Statistical Machine Translation System . [2012-03-01] http://www.statmt.org/moses/.

[7] Linguistic Data Consortium[DB/OL]. [2012-03-01]. http://www.ldc.upenn.edu/.

[8] NIST Open Machine Translation (OpenMT) Evaluation[DB/OL]. [2012-03-01]. http://www.itl.nist.gov/iad/mig/tests/mt/.

[9] Xu J, Zens R, Ney H. Do We Need Chinese Word Segmentation for Statistical Machine Translation[C]. In: Proceedings of the 3rd SIGHAN Workshop on Chinese Language Learning. 2004: 122-128.

[10] 孙茂松, 邹嘉彦. 汉语自动分词研究评述[J]. 当代语言学 , 2001, 3(1): 22-32.(Sun Maosong,Zou Jiayan. A Critical Appraisal of the Research on Chinese Word Segmentation[J]. Contemporary Linguistics,2001,3(1):22-32.)

[11] Tseng H, Chang P, Andrew G, et al. A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005[C].In: Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 2005: 168-171.

[12] ICTCLAS汉语分词系统 . [2012-03-01]. http://www.ictclas.org.(ICTCLAS Chinese Word Segmentation System . [2012-03-01]. http://www.ictclas.org.)

[13] SIGHAN[DB/OL].[2012-03-01]. http://www.sighan.org/.

[14] Xue N W, Shen L B. Chinese Word Segmentation as LMR Tagging[C]. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. 2003: 176-179.

[15] Low J K, Ng H T, Guo W Y. A Maximum Entropy Approach to Chinese Word Segmentation[C].In: Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 2005: 161-164.

[16] Zhao H, Huang C N, Li M. An Improved Chinese Word Segmentation System with Conditional Random Field[C]. In: Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. 2006: 162-165.

[17] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报 , 2007, 21(3): 8-19.(Huang Changning,Zhao Hai. Chinese Word Segmentation:A Decade Review[J]. Journal of Chinese Information Processing,2007, 21(3): 8-19.)

[18] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]. In: Proceedings of the 18th International Conference on Machine Learning (ICML-2001). 2001: 282-289.

[19] Roth D, Yih W T. Integer Linear Programming Inference for Conditional Random Fields[C]. In: Proceedings of the 22nd International Conference on Machine Learning (ICML),Bonn,Germany. 2005: 737-744.

[20] Zhang R Q, Kikui G, Sumita E. Subword-based Tagging by Conditional Random Fields for Chinese Word Segmentation[C].In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers(NAACL-Short '06). 2006: 193-196.

[21] 赵海, 揭春雨. 基于有效子串标注的中文分词[J]. 中文信息学报 , 2007, 21(5): 8-13.(Zhao Hai,Jie Chunyu. Effective Subsequence-based Tagging for Chinese Word Segmentation[J]. Journal of Chinese Information Processing, 2007, 21(5): 8-13.)
[1] Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Cheng. Sentence Alignment Method Based on BERT and Multi-similarity Fusion[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[2] Shi Lei,Wang Yi,Cheng Ying,Wei Ruibin. Review of Attention Mechanism in Natural Language Processing[J]. 数据分析与知识发现, 2020, 4(5): 1-14.
[3] Qingmin Liu,Changqing Yao,Chongde Shi,Xiaojie Wen,Yueying Sun. Vocabulary Optimization of Neural Machine Translation for Scientific and Technical Document[J]. 数据分析与知识发现, 2019, 3(3): 76-82.
[4] Feng Guoming,Zhang Xiaodong,Liu Suhui. DBLC Model for Word Segmentation Based on Autonomous Learning[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[5] Ni Weijian,Sun Haohao,Liu Tong,Zeng Qingtian. An Unsupervised Approach to Optimize Chinese Word Segmentation on Domain Literature[J]. 数据分析与知识发现, 2018, 2(2): 96-104.
[6] Zhang Yue,Wang Dongbo,Zhu Danhao. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[7] Yu Xincong, Li Honglian, Lv Xueqiang. Research on the Application of Hyponymy in the Enrollment Robot[J]. 现代图书情报技术, 2015, 31(12): 65-71.
[8] Zhang Jie, Zhang Haichao, Zhai Dongsheng. Research of the Word Segmentation for Chinese Patent Claims[J]. 现代图书情报技术, 2014, 30(9): 91-98.
[9] Shao Jian, Zhang Chengzhi. Automatic Acquisition of Domain Parallel Corpora from Internet[J]. 现代图书情报技术, 2014, 30(12): 36-43.
[10] Shi Chongde, Qiao Xiaodong, Wang Huilin. Decoding Optimization in Tree Transducer based Translation Model[J]. 现代图书情报技术, 2013, 29(9): 23-29.
[11] Li Wenjiang, Chen Shiqin. Application of AIMLBot Intelligent Robot in Real-time Virtual Reference Service[J]. 现代图书情报技术, 2012, 28(7): 127-132.
[12] Jiang Hua, Su Xiaoguang. Chinese High-frequency Words Extraction Algorithm Without Thesaurus[J]. 现代图书情报技术, 2012, 28(6): 50-53.
[13] Gu Jun, Wang Hao. Study on Term Extraction on the Basis of Chinese Domain Texts[J]. 现代图书情报技术, 2011, 27(4): 29-34.
[14] Sun Zhen Wang Huilin. Overview on the Advance of the Research on Named Entity Recognition[J]. 现代图书情报技术, 2010, 26(6): 42-47.
[15] Xie Hui,Qin Jie,Hu Shuangshuang. The Study on the Duplicated Web Pages Detection Algorithm Based on the Keyword from User’s Submission[J]. 现代图书情报技术, 2008, 24(7): 43-46.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn