现代图书情报技术  2013, Vol. Issue (6): 36-41    DOI: 10.11925/infotech.1003-3513.2013.06.06
北京外国语大学中国外语教育研究中心 北京 100089
Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus
Xiong Wenxin
National Research Centre for Foreign Language Education, Beijing Foreign Studies University, Beijing 100089, China
摘要 从资源建设角度对现有基于统计的句对齐工具进行用户易用性及性能比较,认为Champollion比较适合英汉双语句对齐处理。借鉴“基于转换错误驱动”的思路,对Champollion对齐错误结果利用语言学规则实施再对齐,使句对齐效果进一步提升。以英汉环保领域专业文本为例,句对齐的准确率从最初的88.74%上升至93.91%。这种结合基于统计对齐工具和语言学知识应用的对齐和再对齐处理方法在“分步骤按领域”建设大规模双语语料库的过程中具有普适性。
关键词 英汉平行语料库环保文本句对齐再对齐基于转换错误驱动    
Abstract:Sentence alignment is a crucial step for building parallel corpus. There are plenty of such tools available for constructing a language repository for machine translation systems. Based on the evaluation regarding user-friendly design and alignment quality, the performance of Champollion is superior to other mainstream open source tools in aligning English-Chinese parallel texts. Inspired by “transformation-based error-driven” strategy, the author makes a thorough linguistic analysis on the error output produced by Champollion, and proposes an error correction strategy which improves the precision rate dramatically. The realignment approach as a module attached to Champollion’s output can reach a precision rate 93.91% from baseline 88.74%, in the case of alignment of English-Chinese texts in the area of environmental protection. This alignment and realignment strategy combined statistics-based method with linguistic insights can be applied to other domains.
Key wordsEnglish-Chinese parallel corpus    Environmental protection text    Sentence alignment    Re-Alignment    Transformation-based error-driven
收稿日期: 2013-04-11     
:  TP361  
通讯作者: 熊文新     E-mail:
熊文新. 英汉环保领域平行语料的句对齐与再对齐[J]. 现代图书情报技术, 2013, (6): 36-41.
Xiong Wenxin. Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2013.06.06.
