New Technology of Library and Information Service  2014, Vol. 30 Issue (12): 36-43    DOI: 10.11925/infotech.1003-3513.2014.12.05
Automatic Acquisition of Domain Parallel Corpora from Internet
Shao Jian, Zhang Chengzhi
School of Economics & Management, Nanjing University of Science & Technology, Nanjing 210094, China
[Objective] To automatically obtain domain parallel corpora via classified bilingual corpora and sentence alignment. [Methods] Classify bilingual corpora based on text classification technology, use sentence alignment tool to align classified bilingual corpus based on length information of bilingual sentence and bilingual dictionary. This paper uses artificial aligned bilingual corpora to calculate length parameters. [Results] The results obtain 95.45% rate of sentence aligned correctly. The length mean is 1.7777 and variance is 1.2640. [Limitations] Due to the extent of the initial alignment of bilingual corpus is satisfied, so the result of alignment is not universally representative. [Conclusions] The result proves the method presented in this paper is effective, so this method can acquire high quality domain parallel corpora.

Key wordsSentence alignment      Text classification      Parallel corpora      Machine translation     
Received: 30 June 2014      Published: 20 January 2015
Shao Jian, Zhang Chengzhi. Automatic Acquisition of Domain Parallel Corpora from Internet. New Technology of Library and Information Service, 2014, 30(12): 36-43.

