[Objective] This paper proposes a method automatically aligning bilingual sentences, aiming to provide technical support for constructing bilingual parallel corpus, cross-language information retrieval and other natural language processing tasks. [Methods] First, we added the BERT pre-training to the method of sentence alignment, and extracted features with a two-way Transformer. Then, we represented the words’ semantics with Position embeddings, Token embeddings, and Segment embeddings. Third, we bi-directionally measured the source language sentence and its translation, as well as the target language sentence and its translation. Finally, we combined the BLEU score, cosine similarity and Manhattan distance to generate the final sentence alignment. [Results] We conducted two rounds of tests to evaluate the effectiveness of the new method. In the parallel corpus filtering task, the recall was 97.84%. In the comparable corpus filtering task, the accuracy reached 99.47%, 98.31%, and 95.00%, when the noise ratio was 20%, 50%, and 90%, respectively. [Limitations] The text representation and similarity calculation could be further improved by adding more semantic information. [Conclusions] The proposed method, which is better than the baseline systems in parallel corpus filtering and comparable corpus filtering tasks, could generate large scale and high-quality parallel corpus.
刘文斌, 何彦青, 吴振峰, 董诚. 基于BERT和多相似度融合的句子对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Cheng. Sentence Alignment Method Based on BERT and Multi-similarity Fusion. Data Analysis and Knowledge Discovery, 2021, 5(7): 48-58.
Devlin J, Zbib R, Huang Z Q, et al. Fast and Robust Neural Network Joint Models for Statistical Machine Translation[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 1370-1380.
Vogel S, Tribble A. Improving Statistical Machine Translation for a Speech-to-Speech Translation Task[C]// Proceedings of the 7th International Conference on Spoken Language Processing. 2002: 1901-1904.
Klavans J L, Tzoukermann E. The BICORD System Combining Lexical Information from Bilingual Corpora and Machine Readable Dictionaries[C]// Proceedings of the 13th International Conference on Computational Linguistics. 1990: 174-179.
(Zhang Xia, Zan Hongying, Zhang Enzhan. Study on Length Computation Method of Chinese-English Sentence Alignment[J]. Computer Engineering and Design, 2009, 30(18):4356-4358.)
Kraaij W, Nie J Y, Simard M. Embedding Web-based Statistical Translation Models in Cross-language Information Retrieval[J]. Computational Linguistics, 2003, 29(3):381-419.
Nie J Y, Simard M, Isabelle P, et al. Cross-language Information Retrieval Based on Parallel Texts and Automatic Mining of Parallel Texts from the Web[C]// Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999: 74-81.
(Wang Yasong, Liu Mingtong, Ma Binbin, et al. The Construction of Chinese Paraphrase Parallel Corpus Based on Multiple Translation Engines[J]. Technology Intelligence Engineering, 2020, 6(5):27-40.)
Grégoire F, Langlais P. A Deep Neural Network Approach to Parallel Sentence Extraction[OL]. arXiv Preprint, arXiv: 1709. 09783.
Gale W A, Church K. A Program for Aligning Sentences in Bilingual Corpora[J]. Computational Linguistics, 1993, 19(1):75-102.
Brown P F, Lai J C, Mercer R L. Aligning Sentences in Parallel Corpora[C]// Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. 1991: 169-176.
Simard M, Foster G F, Isabelle P. Using Cognates to Align Sentences in Bilingual Corpora[C]// Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research. 1993: 1071-1082.
Church K. Char_align: A Program for Aligning Parallel Texts at the Character Level[C]// Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. 1993: 1-8.
Kay M, Roscheisen M. Text-translation Alignment[J]. Computational Linguistics, 1993, 19(1):121-142.
Chen S F. Aligning Sentences in Bilingual Corpora Using Lexical Information[C]// Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. 1993: 9-16.
Sennrich R, Volk M. MT-based Sentence Alignment for OCR-generated Parallel Texts[C]// Proceedings of the 9th Conference of the Association for Machine Translation in the Americas. 2010. DOI: 10.5167/uzh-38464.
Wu D K. Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria[C]// Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 1994:80-87.
Moore R C. Fast and Accurate Sentence Alignment of Bilingual Corpora[C]// Proceedings of the 5th Conference of the Association for Machine Translation in the Americas. Springer, Berlin, Heidelberg, 2002: 135-144.
Ma X Y. Champollion: A Robust Parallel Text Sentence Aligner[C]// Proceedings of the 5th International Conference on Language Resources and Evaluation. 2006: 489-492.
Li P, Sun M S, Xue P. Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm[C]// Proceedings of the 23rd International Conference on Computational Linguistics Posters. 2010: 710-718.
Quan X J, Kit C, Song Y. Non-monotonic Sentence Alignment via Semisupervised Learning[C]// Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 2013: 622-630.
Grover J, Mitra P. Bilingual Word Embeddings with Bucketed CNN for Parallel Sentence Extraction[C]// Proceedings of the Association for Computational Linguistics 2017, Student Research Workshop. 2017: 11-16.
Bouamor H, Sajjad H. H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings[C]// Proceedings of the Workshop on Building and Using Comparable Corpora. 2018.
Guo M, Shen Q L, Yang Y F, et al. Effective Parallel Corpus Mining Using Bilingual Sentence Embeddings[C]// Proceedings of the 3rd Conference on Machine Translation. 2018: 165-176.
Hassan H, Aue A, Chen C, et al. Achieving Human Parity on Automatic Chinese to English News Translation[OL]. arXiv Preprint, arXiv: 1803.05567.
Schwenk H. Filtering and Mining Parallel Data in a Joint Multilingual Space[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 228-234.
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 5998-6008.