Please wait a minute...
Advanced Search
现代图书情报技术  2013, Vol. Issue (6): 36-41    DOI: 10.11925/infotech.1003-3513.2013.06.06
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
北京外国语大学中国外语教育研究中心 北京 100089
Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus
Xiong Wenxin
National Research Centre for Foreign Language Education, Beijing Foreign Studies University, Beijing 100089, China
全文: PDF(471 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 从资源建设角度对现有基于统计的句对齐工具进行用户易用性及性能比较,认为Champollion比较适合英汉双语句对齐处理。借鉴“基于转换错误驱动”的思路,对Champollion对齐错误结果利用语言学规则实施再对齐,使句对齐效果进一步提升。以英汉环保领域专业文本为例,句对齐的准确率从最初的88.74%上升至93.91%。这种结合基于统计对齐工具和语言学知识应用的对齐和再对齐处理方法在“分步骤按领域”建设大规模双语语料库的过程中具有普适性。
E-mail Alert
关键词 英汉平行语料库环保文本句对齐再对齐基于转换错误驱动    
Abstract:Sentence alignment is a crucial step for building parallel corpus. There are plenty of such tools available for constructing a language repository for machine translation systems. Based on the evaluation regarding user-friendly design and alignment quality, the performance of Champollion is superior to other mainstream open source tools in aligning English-Chinese parallel texts. Inspired by “transformation-based error-driven” strategy, the author makes a thorough linguistic analysis on the error output produced by Champollion, and proposes an error correction strategy which improves the precision rate dramatically. The realignment approach as a module attached to Champollion’s output can reach a precision rate 93.91% from baseline 88.74%, in the case of alignment of English-Chinese texts in the area of environmental protection. This alignment and realignment strategy combined statistics-based method with linguistic insights can be applied to other domains.
Key wordsEnglish-Chinese parallel corpus    Environmental protection text    Sentence alignment    Re-Alignment    Transformation-based error-driven
收稿日期: 2013-04-11     
:  TP361  
通讯作者: 熊文新     E-mail:
熊文新. 英汉环保领域平行语料的句对齐与再对齐[J]. 现代图书情报技术, 2013, (6): 36-41.
Xiong Wenxin. Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2013.06.06.
[1] Koehn P. Europarl: A Parallel Corpus for Statistical Machine Translation[C]. In: Proceedings of the 10th Machine Translation Summit,Phuket Island, Thailand. 2005, 5:79-86.
[2] Dandapat S, Morrissey S, Naskar S K, et al. Statistically Motivated Example-based Machine Translation Using Translation Memory[C]. In: Proceedings of the 8th International Conference on Natural Language Processing, Kharagpur, India. 2010:168-177.
[3] 王克非, 熊文新. 汉英对应语料库的检索及应用[J]. 外语电化教学, 2011(6):31-36. (Wang Kefei, Xiong Wenxin. Design and Application of Sentence Pair Retrieval from Parallel Corpora for Translation Studies and Translation Teaching [J]. Media in Foreign Language Instruction, 2011(6):31-36.)
[4] McEnery A, Xiao Z. Parallel and Comparable Corpora: What Are They Up To? [A].// Anderman G M, Rogers M A. Incorporating Corpora: The Linguist and the Translator[M].Clevedon: Multilingual Matters, 2007.
[5] Brown P F, Lai J C, Mercer R L. Aligning Sentence in Parallel Corpora[C]. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACL’91). Stroudsburg: Association for Computational Linguistics, 1991: 169-176.
[6] Gale W A, Church K W. A Program for Aligning Sentences in Bilingual Corpora[C]. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACL’91). Stroudsburg: Association for Computational Linguistics, 1991: 177-184.
[7] Church K W. Char_align: A Program for Aligning Parallel Texts at the Character Level[C]. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL’93), Coumbus, OH, USA. Stroudsburg: Association for Computational Linguistics,1993:1-8.
[8] Och F J, Ney H. A Systematic Comparison of Various Statistical Alignment Models[J]. Computational Linguistics, 2003, 29(1): 19-51.
[9] Utsuro T, Ikeda H, Yamane M, et al. Bilingual Text Matching Using Bilingual Dictionary and Statistics[C]. In: Proceedings of the 15th International Conference on Computational Linguistics(COLING’94), Kyoto, Japan. Stroudsburg: Association for Computational Linguistics, 1994, 2:1076-1082.
[10] Simard M, Foster G F, Isabelle P. Using Cognates to Align Sentences in Bilingual Corpora[C]. In: Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research: Distributed Computing(CASCON’93). IBM Press, 1993, 2:1071-1082.
[11] Brill E. Transformation-based Error-driven Learning and Natural Language Processing: A Case Study in Part-of-speech Tagging [J]. Computational Linguistics, 1995, 21 (4): 543-565.
[12] Harris Z. Language and Information [M]. New York: Columbia University Press, 1988.
[13] 熊文新. Web、语料库与双语平行语料库的建设[J]. 图书情报工作, 2013,57(10):128-135.(Xiong Wenxin.Web, Corpus and the Building of Bilingual Parallel Corpus[J]. Library and Information Service, 2013,57(10):128-135.)
[14] Ma X. Champollion: A Robust Parallel Text Sentence Aligner[C]. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, Genoa, Italy. 2006: 489-492.
[15] Varga D, Nemeth L, Halacsy P, et al. Parallel Corpora for Medium Density Languages[C]. In: Proceedings of Recent Advances in Natural Language Processing(RANLP’05), Borovets, Bulgaria. 2005:590-596.
[16] Moore R C. Fast and Accurate Sentence Alignment of Bilingual Corpora[C]. In: Proceedings of Machine Translation: From Research to Real Users. Springer, 2002, 2499: 135-144.
[17] Li P, Sun M, Xue P. Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm[C]. In: Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China. 2010:710-718.
[1] 王东波,苏新宁. 英汉双语句子级平行语料库自动构建*[J]. 现代图书情报技术, 2009, 25(12): 47-51.
Full text



版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190