Please wait a minute...
New Technology of Library and Information Service  2013, Vol. Issue (6): 36-41    DOI: 10.11925/infotech.1003-3513.2013.06.06
Current Issue | Archive | Adv Search |
Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus
Xiong Wenxin
National Research Centre for Foreign Language Education, Beijing Foreign Studies University, Beijing 100089, China
Download: PDF(471 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  Sentence alignment is a crucial step for building parallel corpus. There are plenty of such tools available for constructing a language repository for machine translation systems. Based on the evaluation regarding user-friendly design and alignment quality, the performance of Champollion is superior to other mainstream open source tools in aligning English-Chinese parallel texts. Inspired by “transformation-based error-driven” strategy, the author makes a thorough linguistic analysis on the error output produced by Champollion, and proposes an error correction strategy which improves the precision rate dramatically. The realignment approach as a module attached to Champollion’s output can reach a precision rate 93.91% from baseline 88.74%, in the case of alignment of English-Chinese texts in the area of environmental protection. This alignment and realignment strategy combined statistics-based method with linguistic insights can be applied to other domains.
Key wordsEnglish-Chinese parallel corpus      Environmental protection text      Sentence alignment      Re-Alignment      Transformation-based error-driven     
Received: 11 April 2013      Published: 24 July 2013
:  TP361  

Cite this article:

Xiong Wenxin. Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus. New Technology of Library and Information Service, 2013, (6): 36-41.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2013.06.06     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2013/V/I6/36

[1] Koehn P. Europarl: A Parallel Corpus for Statistical Machine Translation[C]. In: Proceedings of the 10th Machine Translation Summit,Phuket Island, Thailand. 2005, 5:79-86.
[2] Dandapat S, Morrissey S, Naskar S K, et al. Statistically Motivated Example-based Machine Translation Using Translation Memory[C]. In: Proceedings of the 8th International Conference on Natural Language Processing, Kharagpur, India. 2010:168-177.
[3] 王克非, 熊文新. 汉英对应语料库的检索及应用[J]. 外语电化教学, 2011(6):31-36. (Wang Kefei, Xiong Wenxin. Design and Application of Sentence Pair Retrieval from Parallel Corpora for Translation Studies and Translation Teaching [J]. Media in Foreign Language Instruction, 2011(6):31-36.)
[4] McEnery A, Xiao Z. Parallel and Comparable Corpora: What Are They Up To? [A].// Anderman G M, Rogers M A. Incorporating Corpora: The Linguist and the Translator[M].Clevedon: Multilingual Matters, 2007.
[5] Brown P F, Lai J C, Mercer R L. Aligning Sentence in Parallel Corpora[C]. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACL’91). Stroudsburg: Association for Computational Linguistics, 1991: 169-176.
[6] Gale W A, Church K W. A Program for Aligning Sentences in Bilingual Corpora[C]. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACL’91). Stroudsburg: Association for Computational Linguistics, 1991: 177-184.
[7] Church K W. Char_align: A Program for Aligning Parallel Texts at the Character Level[C]. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL’93), Coumbus, OH, USA. Stroudsburg: Association for Computational Linguistics,1993:1-8.
[8] Och F J, Ney H. A Systematic Comparison of Various Statistical Alignment Models[J]. Computational Linguistics, 2003, 29(1): 19-51.
[9] Utsuro T, Ikeda H, Yamane M, et al. Bilingual Text Matching Using Bilingual Dictionary and Statistics[C]. In: Proceedings of the 15th International Conference on Computational Linguistics(COLING’94), Kyoto, Japan. Stroudsburg: Association for Computational Linguistics, 1994, 2:1076-1082.
[10] Simard M, Foster G F, Isabelle P. Using Cognates to Align Sentences in Bilingual Corpora[C]. In: Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research: Distributed Computing(CASCON’93). IBM Press, 1993, 2:1071-1082.
[11] Brill E. Transformation-based Error-driven Learning and Natural Language Processing: A Case Study in Part-of-speech Tagging [J]. Computational Linguistics, 1995, 21 (4): 543-565.
[12] Harris Z. Language and Information [M]. New York: Columbia University Press, 1988.
[13] 熊文新. Web、语料库与双语平行语料库的建设[J]. 图书情报工作, 2013,57(10):128-135.(Xiong Wenxin.Web, Corpus and the Building of Bilingual Parallel Corpus[J]. Library and Information Service, 2013,57(10):128-135.)
[14] Ma X. Champollion: A Robust Parallel Text Sentence Aligner[C]. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, Genoa, Italy. 2006: 489-492.
[15] Varga D, Nemeth L, Halacsy P, et al. Parallel Corpora for Medium Density Languages[C]. In: Proceedings of Recent Advances in Natural Language Processing(RANLP’05), Borovets, Bulgaria. 2005:590-596.
[16] Moore R C. Fast and Accurate Sentence Alignment of Bilingual Corpora[C]. In: Proceedings of Machine Translation: From Research to Real Users. Springer, 2002, 2499: 135-144.
[17] Li P, Sun M, Xue P. Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm[C]. In: Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China. 2010:710-718.
[1] Shao Jian, Zhang Chengzhi. Automatic Acquisition of Domain Parallel Corpora from Internet[J]. 现代图书情报技术, 2014, 30(12): 36-43.
[2] Wang Dongbo, Han Pu, Shen Si, Wei Xiangqing. Research of Mining the Category Knowledge Based on English-Chinese Humanities and Social Sciences Parallel Corpus in Phrase Level[J]. 现代图书情报技术, 2012, (11): 40-46.
[3] Wang Dongbo,Su Xinning. Automatic Building of Sentence-Level English-Chinese Parallel Corpus[J]. 现代图书情报技术, 2009, 25(12): 47-51.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn