[Objective] To automatically obtain domain parallel corpora via classified bilingual corpora and sentence alignment. [Methods] Classify bilingual corpora based on text classification technology, use sentence alignment tool to align classified bilingual corpus based on length information of bilingual sentence and bilingual dictionary. This paper uses artificial aligned bilingual corpora to calculate length parameters. [Results] The results obtain 95.45% rate of sentence aligned correctly. The length mean is 1.7777 and variance is 1.2640. [Limitations] Due to the extent of the initial alignment of bilingual corpus is satisfied, so the result of alignment is not universally representative. [Conclusions] The result proves the method presented in this paper is effective, so this method can acquire high quality domain parallel corpora.
邵健, 章成志. 从互联网上自动获取领域平行语料[J]. 现代图书情报技术, 2014, 30(12): 36-43.
Shao Jian, Zhang Chengzhi. Automatic Acquisition of Domain Parallel Corpora from Internet. New Technology of Library and Information Service, 2014, 30(12): 36-43.
[1] Koehn P. Europarl: A Parallel Corpus for Statistical Machine Translation [C]. In: Proceedings of the 10th Machine Translation Summit, Phuket, Thailand. 2005: 79-86.
[2] 吴琳, 魏星, 霍翠婷. 基于 Web 的专利双语语料自动获取研究及实现——以esp@cenet数据库为例[J]. 现代图书情报技术, 2009(9): 57-63. (Wu Lin, Wei Xing, Huo Cuiting. Research and Implement of Automatic Patent Bilingual Corpus Extraction from Web——Taking esp@cenet as an Example [J]. New Technology of Library and Information Service, 2009(9): 57-63. )
[3] Resnik P, Smith N A. The Web as a Parallel Corpus [J]. Computational Linguistics, 2003, 29(3): 349-380.
[4] Ma X, Liberman M Y. BITS: A Method for Bilingual Text Search over the Web [C]. In: Proceedings of Machine Translation Summit VII, Singapore. 1999.
[5] Chen J, Nie J. Automatic Construction of Parallel English-Chinese Corpus for Cross-Language Information Retrieval [C]. In: Proceedings of the 6th Applied Natural Language Processing Conference, Seattle, Washington, USA. 2000: 21-28.
[6] Zhang Y, Wu K, Gao J, et al. Automatic Acquisition of Chinese-English Parallel Corpus from the Web [C]. In: Proceedings of the 28th European Conference on IR Research, London, UK. Springer Berlin Heidelberg, 2006: 420-431.
[7] Zhang C Z, Yao X C, Kit C. Finding More Bilingual Web Pages with High Credibility via Link Analysis [C]. In: Proceedings of the 6th Workshop on Building and Using Comparable Corpora, Sofia, Bulgaria. 2013.
[8] 刘奇, 刘洋, 孙茂松. URL 模式与 HTML 结构相结合的平行网页获取方法 [J]. 中文信息学报, 2013, 27(3): 91-99. (Liu Qi, Liu Yang, Sun Maosong. A Parallel Pages Mining Approach: Comibing URL Patterns and HTML Structures [J]. Journal of Chinese Information Processing, 2013, 27(3): 91-99.)
[9] Gale W A, Church K W. A Program for Aligning Sentences in Bilingual Corpora [J]. Computational Linguistics, 1993, 19(1): 75-102.
[10] Brown P F, Lai J C, Mercer R L. Aligning Sentences in Parallel Corpora [C]. In: Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, 1991: 169-176.
[11] Kay M, Röscheisen M. Text-translation Alignment [J]. Computational Linguistics, 1993, 19(1): 121-142.
[12] Chen S F. Aligning Sentences in Bilingual Corpora Using Lexical Information [C]. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, 1993.
[13] Church K W. Char_align: A Program for Aligning Parallel Texts at the Character Level [C]. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, 1993.
[14] Wu D. Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria [C]. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 1994: 80-87.
[15] Moore R C. Fast and Accurate Sentence Alignment of Bilingual Corpora [C]. In: Proceedings of the 5th Conference of the Association for Machine Translation in the Americas, Tiburon, CA, USA. 2002: 135-144.
[16] Fattah M A, Bracewell D B, Ren F, et al. Sentence Alignment Using P-NNT and GMM [J]. Computer Speech & Language, 2007,21(4): 594-608.
[17] Sennrich R, Volk M. MT-based Sentence Alignment for Ocr-generated Parallel Texts [C]. In: Proceedings of the 9th Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, Colorado, USA. 2010.
[18] Trieu H L, Nguyen P T, Nguyen K A. Improving Moore's Sentence Alignment Method Using Bilingual Word Clustering [C]. In: Proceedings of the 5th International Conference on Knowledge and Systems Engineering. Springer International Publishing, 2014: 149-160.
[19] 熊文新. 英汉环保领域平行语料的句对齐与再对齐[J]. 现代图书情报技术, 2013(6): 36-41. (Xiong Wenxin. Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus [J]. New Technology of Library and Information Service, 2013(6): 36-41.)
[20] Vapnik V N. The Nature of Statistical Learning Theory[M]. Springer New York, 2000.
[21] Forman G. An Extensive Empirical Study of Feature Selection Metrics for Text Classification [J]. Journal of Machine Learning Research, 2003, 3: 1289-1305.
[22] Mesleh A M A. Chi Square Feature Extraction Based SVMs Arabic Language Text Categorization System [J]. Journal of Computer Science, 2007, 3(6): 430-435.
[23] Peng H, Long F, Ding C. Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(8): 1226-1238.
[24] Robertson S. Understanding Inverse Document Frequency: On Theoretical Arguments for IDF [J]. Journal of Documentation, 2004, 60(5): 503-520.
[25] Varga D, Halácsy P, Kornai A, et al. Parallel Corpora for Medium Density Languages [A].// Recent Advances in Natural Language Processing IV [M]. John Benjamins Publishing Company, 2007: 247-258.
[26] Ma X. Champollion: A Robust Parallel Text Sentence Aligner [C]. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, 2006.
[27] Chang C C, Lin C J. LIBSVM: A Library for Support Vector Machines [J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): Article No.27.
[28] Stolcke A. SRILM -An Extensible Language Modeling Toolkit [C]. In: Proceedings of the 7th International Conference on Spoken Language Processing, Denver, Colorado, USA. 2002.
[29] Xiao T, Zhu J, Zhang H, et al. NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation [C]. In: Proceedings of the ACL 2012 System Demonstrations, 2012.
[30] Przybocki M A, Peterson K, Bronsart S. Translation Adequacy and Preference Evaluation Tool (TAP-ET) [C]. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC, Marrakech, Morocco. 2008.