Abstract:Sentence alignment is a crucial step for building parallel corpus. There are plenty of such tools available for constructing a language repository for machine translation systems. Based on the evaluation regarding user-friendly design and alignment quality, the performance of Champollion is superior to other mainstream open source tools in aligning English-Chinese parallel texts. Inspired by “transformation-based error-driven” strategy, the author makes a thorough linguistic analysis on the error output produced by Champollion, and proposes an error correction strategy which improves the precision rate dramatically. The realignment approach as a module attached to Champollion’s output can reach a precision rate 93.91% from baseline 88.74%, in the case of alignment of English-Chinese texts in the area of environmental protection. This alignment and realignment strategy combined statistics-based method with linguistic insights can be applied to other domains.
熊文新. 英汉环保领域平行语料的句对齐与再对齐[J]. 现代图书情报技术, 2013, (6): 36-41.
Xiong Wenxin. Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus. New Technology of Library and Information Service, 2013, (6): 36-41.
[1] Koehn P. Europarl: A Parallel Corpus for Statistical Machine Translation[C]. In: Proceedings of the 10th Machine Translation Summit,Phuket Island, Thailand. 2005, 5:79-86. [2] Dandapat S, Morrissey S, Naskar S K, et al. Statistically Motivated Example-based Machine Translation Using Translation Memory[C]. In: Proceedings of the 8th International Conference on Natural Language Processing, Kharagpur, India. 2010:168-177. [3] 王克非, 熊文新. 汉英对应语料库的检索及应用[J]. 外语电化教学, 2011(6):31-36. (Wang Kefei, Xiong Wenxin. Design and Application of Sentence Pair Retrieval from Parallel Corpora for Translation Studies and Translation Teaching [J]. Media in Foreign Language Instruction, 2011(6):31-36.) [4] McEnery A, Xiao Z. Parallel and Comparable Corpora: What Are They Up To? [A].// Anderman G M, Rogers M A. Incorporating Corpora: The Linguist and the Translator[M].Clevedon: Multilingual Matters, 2007. [5] Brown P F, Lai J C, Mercer R L. Aligning Sentence in Parallel Corpora[C]. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACL’91). Stroudsburg: Association for Computational Linguistics, 1991: 169-176. [6] Gale W A, Church K W. A Program for Aligning Sentences in Bilingual Corpora[C]. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACL’91). Stroudsburg: Association for Computational Linguistics, 1991: 177-184. [7] Church K W. Char_align: A Program for Aligning Parallel Texts at the Character Level[C]. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL’93), Coumbus, OH, USA. Stroudsburg: Association for Computational Linguistics,1993:1-8. [8] Och F J, Ney H. A Systematic Comparison of Various Statistical Alignment Models[J]. Computational Linguistics, 2003, 29(1): 19-51. [9] Utsuro T, Ikeda H, Yamane M, et al. Bilingual Text Matching Using Bilingual Dictionary and Statistics[C]. In: Proceedings of the 15th International Conference on Computational Linguistics(COLING’94), Kyoto, Japan. Stroudsburg: Association for Computational Linguistics, 1994, 2:1076-1082. [10] Simard M, Foster G F, Isabelle P. Using Cognates to Align Sentences in Bilingual Corpora[C]. In: Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research: Distributed Computing(CASCON’93). IBM Press, 1993, 2:1071-1082. [11] Brill E. Transformation-based Error-driven Learning and Natural Language Processing: A Case Study in Part-of-speech Tagging [J]. Computational Linguistics, 1995, 21 (4): 543-565. [12] Harris Z. Language and Information [M]. New York: Columbia University Press, 1988. [13] 熊文新. Web、语料库与双语平行语料库的建设[J]. 图书情报工作, 2013,57(10):128-135.(Xiong Wenxin.Web, Corpus and the Building of Bilingual Parallel Corpus[J]. Library and Information Service, 2013,57(10):128-135.) [14] Ma X. Champollion: A Robust Parallel Text Sentence Aligner[C]. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, Genoa, Italy. 2006: 489-492. [15] Varga D, Nemeth L, Halacsy P, et al. Parallel Corpora for Medium Density Languages[C]. In: Proceedings of Recent Advances in Natural Language Processing(RANLP’05), Borovets, Bulgaria. 2005:590-596. [16] Moore R C. Fast and Accurate Sentence Alignment of Bilingual Corpora[C]. In: Proceedings of Machine Translation: From Research to Real Users. Springer, 2002, 2499: 135-144. [17] Li P, Sun M, Xue P. Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm[C]. In: Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China. 2010:710-718.