英汉环保领域平行语料的句对齐与再对齐

doi:10.11925/infotech.1003-3513.2013.06.06

现代图书情报技术

2013, Vol.

Issue (6): 36-41 https://doi.org/10.11925/infotech.1003-3513.2013.06.06

知识组织与知识管理

本期目录 | 过刊浏览 | 高级检索

英汉环保领域平行语料的句对齐与再对齐

熊文新

北京外国语大学中国外语教育研究中心北京 100089

Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus

Xiong Wenxin

National Research Centre for Foreign Language Education, Beijing Foreign Studies University, Beijing 100089, China

摘要
参考文献
相关文章
Metrics

全文: PDF (471 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要从资源建设角度对现有基于统计的句对齐工具进行用户易用性及性能比较,认为Champollion比较适合英汉双语句对齐处理。借鉴“基于转换错误驱动”的思路,对Champollion对齐错误结果利用语言学规则实施再对齐,使句对齐效果进一步提升。以英汉环保领域专业文本为例,句对齐的准确率从最初的88.74%上升至93.91%。这种结合基于统计对齐工具和语言学知识应用的对齐和再对齐处理方法在“分步骤按领域”建设大规模双语语料库的过程中具有普适性。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	熊文新

关键词 ：英汉平行语料库, 环保文本, 句对齐, 再对齐, 基于转换错误驱动

Abstract：Sentence alignment is a crucial step for building parallel corpus. There are plenty of such tools available for constructing a language repository for machine translation systems. Based on the evaluation regarding user-friendly design and alignment quality, the performance of Champollion is superior to other mainstream open source tools in aligning English-Chinese parallel texts. Inspired by “transformation-based error-driven” strategy, the author makes a thorough linguistic analysis on the error output produced by Champollion, and proposes an error correction strategy which improves the precision rate dramatically. The realignment approach as a module attached to Champollion’s output can reach a precision rate 93.91% from baseline 88.74%, in the case of alignment of English-Chinese texts in the area of environmental protection. This alignment and realignment strategy combined statistics-based method with linguistic insights can be applied to other domains.

Key words： English-Chinese parallel corpus Environmental protection text Sentence alignment Re-Alignment Transformation-based error-driven

收稿日期: 2013-04-11 出版日期: 2013-07-24

TP361

基金资助:本文系教育部人文社会科学研究项目“基于语料库及对应词表的英语特异组合研究”(项目编号:09YJA740013)、国家社会科学基金项目“服务信息检索的自然语言”(项目编号:11BYY051)和教育部新世纪优秀人才支持计划(项目编号:NCET-11-0591)的研究成果之一。

通讯作者: 熊文新 E-mail: xiongwenxin@bfsu.edu.cn

引用本文:

熊文新. 英汉环保领域平行语料的句对齐与再对齐[J]. 现代图书情报技术, 2013, (6): 36-41.
Xiong Wenxin. Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus. New Technology of Library and Information Service, 2013, (6): 36-41.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2013.06.06 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2013/V/I6/36

[1] Koehn P. Europarl: A Parallel Corpus for Statistical Machine Translation[C]. In: Proceedings of the 10th Machine Translation Summit,Phuket Island, Thailand. 2005, 5:79-86.
[2] Dandapat S, Morrissey S, Naskar S K, et al. Statistically Motivated Example-based Machine Translation Using Translation Memory[C]. In: Proceedings of the 8th International Conference on Natural Language Processing, Kharagpur, India. 2010:168-177.
[3] 王克非, 熊文新. 汉英对应语料库的检索及应用[J]. 外语电化教学, 2011(6):31-36. (Wang Kefei, Xiong Wenxin. Design and Application of Sentence Pair Retrieval from Parallel Corpora for Translation Studies and Translation Teaching [J]. Media in Foreign Language Instruction, 2011(6):31-36.)
[4] McEnery A, Xiao Z. Parallel and Comparable Corpora: What Are They Up To? [A].// Anderman G M, Rogers M A. Incorporating Corpora: The Linguist and the Translator[M].Clevedon: Multilingual Matters, 2007.
[5] Brown P F, Lai J C, Mercer R L. Aligning Sentence in Parallel Corpora[C]. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACL’91). Stroudsburg: Association for Computational Linguistics, 1991: 169-176.
[6] Gale W A, Church K W. A Program for Aligning Sentences in Bilingual Corpora[C]. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACL’91). Stroudsburg: Association for Computational Linguistics, 1991: 177-184.
[7] Church K W. Char_align: A Program for Aligning Parallel Texts at the Character Level[C]. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL’93), Coumbus, OH, USA. Stroudsburg: Association for Computational Linguistics,1993:1-8.
[8] Och F J, Ney H. A Systematic Comparison of Various Statistical Alignment Models[J]. Computational Linguistics, 2003, 29(1): 19-51.
[9] Utsuro T, Ikeda H, Yamane M, et al. Bilingual Text Matching Using Bilingual Dictionary and Statistics[C]. In: Proceedings of the 15th International Conference on Computational Linguistics(COLING’94), Kyoto, Japan. Stroudsburg: Association for Computational Linguistics, 1994, 2:1076-1082.
[10] Simard M, Foster G F, Isabelle P. Using Cognates to Align Sentences in Bilingual Corpora[C]. In: Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research: Distributed Computing(CASCON’93). IBM Press, 1993, 2:1071-1082.
[11] Brill E. Transformation-based Error-driven Learning and Natural Language Processing: A Case Study in Part-of-speech Tagging [J]. Computational Linguistics, 1995, 21 (4): 543-565.
[12] Harris Z. Language and Information [M]. New York: Columbia University Press, 1988.
[13] 熊文新. Web、语料库与双语平行语料库的建设[J]. 图书情报工作, 2013,57(10):128-135.(Xiong Wenxin.Web, Corpus and the Building of Bilingual Parallel Corpus[J]. Library and Information Service, 2013,57(10):128-135.)
[14] Ma X. Champollion: A Robust Parallel Text Sentence Aligner[C]. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, Genoa, Italy. 2006: 489-492.
[15] Varga D, Nemeth L, Halacsy P, et al. Parallel Corpora for Medium Density Languages[C]. In: Proceedings of Recent Advances in Natural Language Processing(RANLP’05), Borovets, Bulgaria. 2005:590-596.
[16] Moore R C. Fast and Accurate Sentence Alignment of Bilingual Corpora[C]. In: Proceedings of Machine Translation: From Research to Real Users. Springer, 2002, 2499: 135-144.
[17] Li P, Sun M, Xue P. Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm[C]. In: Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China. 2010:710-718.

[1]	王东波,苏新宁. 英汉双语句子级平行语料库自动构建*[J]. 现代图书情报技术, 2009, 25(12): 47-51.

Viewed

Full text

Abstract

Cited

Shared

Discussed