Please wait a minute...
Advanced Search
现代图书情报技术  2012, Vol. 28 Issue (4): 29-34    DOI: 10.11925/infotech.1003-3513.2012.04.05
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
统计机器翻译中文分词优化技术研究
石崇德, 王惠临
中国科学技术信息研究所 北京 100038
Research on Chinese Word Segmentation Optimization in Statistical Machine Translation
Shi Chongde, Wang Huilin
Institute of Scientific & Technical Information of China, Beijing 100038, China
全文: PDF(742 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 研究分词在统计机器翻译中的影响因素,分析不同分词对机器翻译词对齐模型的影响,提出基于粒度约束和子串标注的分词优化方法,并通过优化分词提高机器翻译的效果。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
石崇德
王惠临
关键词 中文分词机器翻译粒度约束子串标注    
Abstract:This paper analyzes the different segmentation approaches and how they work on word alignment of Statistical Machine Translation (SMT). Then it proposes two optimization methods of Chinese Word Segmentation(CWS) based on granularity constraint and sub-word tagging. Experiment results show that these methods can improve the quality of machine translation.
Key wordsChinese word segmentation    Machine translation    Granularity constraint    Sub-word tagging
收稿日期: 2012-02-23     
: 

TP391.2

 
基金资助:

本文系中国科学技术信息研究所重点工作项目“多语言信息获取关键技术研究与应用示范”(项目编号:ZD2011-3-3)、中国科学技术信息研究所学科建设项目“自然语言处理”(项目编号:XK2011-6)和中国科学技术信息研究所预研基金项目“基于实例的机器翻译理论和关键算法研究”(项目编号:YY-201126)的研究成果之一。

引用本文:   
石崇德, 王惠临. 统计机器翻译中文分词优化技术研究[J]. 现代图书情报技术, 2012, 28(4): 29-34.
Shi Chongde, Wang Huilin. Research on Chinese Word Segmentation Optimization in Statistical Machine Translation. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2012.04.05.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2012.04.05
[1] Koehn P. Statistical Machine Translation[M]. Cambridge University Press, 2010.

[2] Zhang R Q, Yasuda K, Sumita E. Chinese Word Segmentation and Statistical Machine Translation[J]. ACM Transactions on Speech and Language Processing (TSLP), 2008, 5(2): 1-19.

[3] Zhang R Q, Yasuda K, Sumita E. Improved Statistical Machine Translation by Multiple Chinese Word Segmentation[C].In: Proceedings of the 3rd Workshop on Statistical Machine Translation. 2008: 216-223.

[4] Chang P C, Galley M, Manning C D. Optimizing Chinese Word Segmentation for Machine Translation Performance [C]. In:Proceedings of the 3rd Workshop on Statistical Machine Translation. 2008: 224-232.

[5] Paul M, Finch A, Sumita E. Language Independent Word Segmentation for Statistical Machine Translation[C]. In:Proceedings of the 3rd International Universal Communication Symposium. 2009:36-40.

[6] Moses Statistical Machine Translation System . [2012-03-01] http://www.statmt.org/moses/.

[7] Linguistic Data Consortium[DB/OL]. [2012-03-01]. http://www.ldc.upenn.edu/.

[8] NIST Open Machine Translation (OpenMT) Evaluation[DB/OL]. [2012-03-01]. http://www.itl.nist.gov/iad/mig/tests/mt/.

[9] Xu J, Zens R, Ney H. Do We Need Chinese Word Segmentation for Statistical Machine Translation[C]. In: Proceedings of the 3rd SIGHAN Workshop on Chinese Language Learning. 2004: 122-128.

[10] 孙茂松, 邹嘉彦. 汉语自动分词研究评述[J]. 当代语言学 , 2001, 3(1): 22-32.(Sun Maosong,Zou Jiayan. A Critical Appraisal of the Research on Chinese Word Segmentation[J]. Contemporary Linguistics,2001,3(1):22-32.)

[11] Tseng H, Chang P, Andrew G, et al. A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005[C].In: Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 2005: 168-171.

[12] ICTCLAS汉语分词系统 . [2012-03-01]. http://www.ictclas.org.(ICTCLAS Chinese Word Segmentation System . [2012-03-01]. http://www.ictclas.org.)

[13] SIGHAN[DB/OL].[2012-03-01]. http://www.sighan.org/.

[14] Xue N W, Shen L B. Chinese Word Segmentation as LMR Tagging[C]. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. 2003: 176-179.

[15] Low J K, Ng H T, Guo W Y. A Maximum Entropy Approach to Chinese Word Segmentation[C].In: Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 2005: 161-164.

[16] Zhao H, Huang C N, Li M. An Improved Chinese Word Segmentation System with Conditional Random Field[C]. In: Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. 2006: 162-165.

[17] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报 , 2007, 21(3): 8-19.(Huang Changning,Zhao Hai. Chinese Word Segmentation:A Decade Review[J]. Journal of Chinese Information Processing,2007, 21(3): 8-19.)

[18] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]. In: Proceedings of the 18th International Conference on Machine Learning (ICML-2001). 2001: 282-289.

[19] Roth D, Yih W T. Integer Linear Programming Inference for Conditional Random Fields[C]. In: Proceedings of the 22nd International Conference on Machine Learning (ICML),Bonn,Germany. 2005: 737-744.

[20] Zhang R Q, Kikui G, Sumita E. Subword-based Tagging by Conditional Random Fields for Chinese Word Segmentation[C].In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers(NAACL-Short '06). 2006: 193-196.

[21] 赵海, 揭春雨. 基于有效子串标注的中文分词[J]. 中文信息学报 , 2007, 21(5): 8-13.(Zhao Hai,Jie Chunyu. Effective Subsequence-based Tagging for Chinese Word Segmentation[J]. Journal of Chinese Information Processing, 2007, 21(5): 8-13.)
[1] 尤众喜,华薇娜,潘雪莲. 中文分词器对图书评论和情感词典匹配程度的影响 *[J]. 数据分析与知识发现, 2019, 3(7): 23-33.
[2] 刘清民,姚长青,石崇德,温晓洁,孙玥莹. 面向科技文献神经机器翻译词汇表优化研究*[J]. 数据分析与知识发现, 2019, 3(3): 76-82.
[3] 冯国明,张晓冬,刘素辉. 基于自主学习的专业领域文本DBLC分词模型[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[4] 倪维健,孙浩浩,刘彤,曾庆田. 面向领域文献的无监督中文分词自动优化方法*[J]. 数据分析与知识发现, 2018, 2(2): 96-104.
[5] 张越,王东波,朱丹浩. 面向食品安全突发事件汉语分词的特征选择及模型优化研究*[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[6] 余昕聪, 李红莲, 吕学强. 本体上下位关系在招生问答机器人中的应用研究[J]. 现代图书情报技术, 2015, 31(12): 65-71.
[7] 张杰, 张海超, 翟东升. 面向中文专利权利要求书的分词方法研究[J]. 现代图书情报技术, 2014, 30(9): 91-98.
[8] 邵健, 章成志. 从互联网上自动获取领域平行语料[J]. 现代图书情报技术, 2014, 30(12): 36-43.
[9] 石崇德, 乔晓东, 王惠临. 树转录翻译模型解码优化[J]. 现代图书情报技术, 2013, 29(9): 23-29.
[10] 李文江, 陈诗琴. AIMLBot智能机器人在实时虚拟参考咨询中的应用[J]. 现代图书情报技术, 2012, 28(7): 127-132.
[11] 江华, 苏晓光. 无词典中文高频词快速抽取算法[J]. 现代图书情报技术, 2012, 28(6): 50-53.
[12] 袁冬, 熊晶, 刘永革. 面向甲骨文的实例机器翻译技术研究[J]. 现代图书情报技术, 2012, 28(5): 48-54.
[13] 谷俊, 王昊. 基于领域中文文本的术语抽取方法研究[J]. 现代图书情报技术, 2011, 27(4): 29-34.
[14] 孙镇 王惠临. 命名实体识别研究进展综述[J]. 现代图书情报技术, 2010, 26(6): 42-47.
[15] 常智荣,马自卫,李高虎. 基于Nutch的专题网页资源采集服务系统的设计与实现[J]. 现代图书情报技术, 2010, 26(3): 19-26.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn