New Technology of Library and Information Service  2009, Vol. 25 Issue (12): 47-51    DOI: 10.11925/infotech.1003-3513.2009.12.09
Automatic Building of Sentence-Level English-Chinese Parallel Corpus
Wang Dongbo   Su Xinning
(Department of Information Management, Nanjing University, Nanjing 210093, China)
This article gives an account of the steps of how to automatically build a large-scale sentence-level English-Chinese parallel corpus based on websites. Specifically speaking, the following questions are addressed: the criterions which are used to grab websites are set and words library is worked out; the websites are automatically grabbed by making use of the tool ‘Wget’; the English-Chinese parallel sentences extracted from websites are subsequently processed and the Chinese sentences are segmented based on Conditional Random Field. Finally, the building of English-Chinese parallel corpus is completed which includes 1 017 963 English-Chinese parallel sentences stored in database which are automatically extracted from 675 308 websites.

Key wordsEnglish-Chinese parallel corpus      Wget      Words library      Conditional random field     
Received: 30 November 2009      Published: 25 December 2009


About author:: Wang Dongbo,Su Xinning

Wang Dongbo,Su Xinning. Automatic Building of Sentence-Level English-Chinese Parallel Corpus. New Technology of Library and Information Service, 2009, 25(12): 47-51.

