|
|
Sentence Alignment Method Based on BERT and Multi-similarity Fusion |
Liu Wenbin,He Yanqing( ),Wu Zhenfeng,Dong Cheng |
Institute of Scientific and Technical Information of China, Beijing 100038, China |
|
|
Abstract [Objective] This paper proposes a method automatically aligning bilingual sentences, aiming to provide technical support for constructing bilingual parallel corpus, cross-language information retrieval and other natural language processing tasks. [Methods] First, we added the BERT pre-training to the method of sentence alignment, and extracted features with a two-way Transformer. Then, we represented the words’ semantics with Position embeddings, Token embeddings, and Segment embeddings. Third, we bi-directionally measured the source language sentence and its translation, as well as the target language sentence and its translation. Finally, we combined the BLEU score, cosine similarity and Manhattan distance to generate the final sentence alignment. [Results] We conducted two rounds of tests to evaluate the effectiveness of the new method. In the parallel corpus filtering task, the recall was 97.84%. In the comparable corpus filtering task, the accuracy reached 99.47%, 98.31%, and 95.00%, when the noise ratio was 20%, 50%, and 90%, respectively. [Limitations] The text representation and similarity calculation could be further improved by adding more semantic information. [Conclusions] The proposed method, which is better than the baseline systems in parallel corpus filtering and comparable corpus filtering tasks, could generate large scale and high-quality parallel corpus.
|
Received: 12 January 2021
Published: 02 April 2021
|
|
Fund:Key Project of Institute of Scientific and Technical Information of China(ZD2020-18) |
Corresponding Authors:
He Yanqing,ORCID:0000-0002-8791-1581
E-mail: heyq@istic.ac.cn
|
[1] |
Devlin J, Zbib R, Huang Z Q, et al. Fast and Robust Neural Network Joint Models for Statistical Machine Translation[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 1370-1380.
|
[2] |
Vogel S, Tribble A. Improving Statistical Machine Translation for a Speech-to-Speech Translation Task[C]// Proceedings of the 7th International Conference on Spoken Language Processing. 2002: 1901-1904.
|
[3] |
Klavans J L, Tzoukermann E. The BICORD System Combining Lexical Information from Bilingual Corpora and Machine Readable Dictionaries[C]// Proceedings of the 13th International Conference on Computational Linguistics. 1990: 174-179.
|
[4] |
张霞, 昝红英, 张恩展. 汉英句子对齐长度计算方法的研究[J]. 计算机工程与设计, 2009, 30(18):4356-4358.
|
[4] |
(Zhang Xia, Zan Hongying, Zhang Enzhan. Study on Length Computation Method of Chinese-English Sentence Alignment[J]. Computer Engineering and Design, 2009, 30(18):4356-4358.)
|
[5] |
Kraaij W, Nie J Y, Simard M. Embedding Web-based Statistical Translation Models in Cross-language Information Retrieval[J]. Computational Linguistics, 2003, 29(3):381-419.
doi: 10.1162/089120103322711587
|
[6] |
Nie J Y, Simard M, Isabelle P, et al. Cross-language Information Retrieval Based on Parallel Texts and Automatic Mining of Parallel Texts from the Web[C]// Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999: 74-81.
|
[7] |
田科, 张家俊. 基于预训练模型的机器翻译译文检测方法[J]. 情报工程, 2020, 6(5):15-26.
|
[7] |
(Tian Ke, Zhang Jiajun. Machine-Translated Text Detection Method Based on Pre-trained Model[J]. Technology Intelligence Engineering, 2020, 6(5):15-26.)
|
[8] |
王雅松, 刘明童, 马彬彬, 等. 基于多翻译引擎的汉语复述平行语料构建方法[J]. 情报工程, 2020, 6(5):27-40.
|
[8] |
(Wang Yasong, Liu Mingtong, Ma Binbin, et al. The Construction of Chinese Paraphrase Parallel Corpus Based on Multiple Translation Engines[J]. Technology Intelligence Engineering, 2020, 6(5):27-40.)
|
[9] |
Grégoire F, Langlais P. A Deep Neural Network Approach to Parallel Sentence Extraction[OL]. arXiv Preprint, arXiv: 1709. 09783.
|
[10] |
Gale W A, Church K. A Program for Aligning Sentences in Bilingual Corpora[J]. Computational Linguistics, 1993, 19(1):75-102.
|
[11] |
Brown P F, Lai J C, Mercer R L. Aligning Sentences in Parallel Corpora[C]// Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. 1991: 169-176.
|
[12] |
Simard M, Foster G F, Isabelle P. Using Cognates to Align Sentences in Bilingual Corpora[C]// Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research. 1993: 1071-1082.
|
[13] |
Church K. Char_align: A Program for Aligning Parallel Texts at the Character Level[C]// Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. 1993: 1-8.
|
[14] |
Kay M, Roscheisen M. Text-translation Alignment[J]. Computational Linguistics, 1993, 19(1):121-142.
|
[15] |
Chen S F. Aligning Sentences in Bilingual Corpora Using Lexical Information[C]// Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. 1993: 9-16.
|
[16] |
Sennrich R, Volk M. MT-based Sentence Alignment for OCR-generated Parallel Texts[C]// Proceedings of the 9th Conference of the Association for Machine Translation in the Americas. 2010. DOI: 10.5167/uzh-38464.
doi: 10.5167/uzh-38464
|
[17] |
Wu D K. Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria[C]// Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 1994:80-87.
|
[18] |
Moore R C. Fast and Accurate Sentence Alignment of Bilingual Corpora[C]// Proceedings of the 5th Conference of the Association for Machine Translation in the Americas. Springer, Berlin, Heidelberg, 2002: 135-144.
|
[19] |
Ma X Y. Champollion: A Robust Parallel Text Sentence Aligner[C]// Proceedings of the 5th International Conference on Language Resources and Evaluation. 2006: 489-492.
|
[20] |
Li P, Sun M S, Xue P. Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm[C]// Proceedings of the 23rd International Conference on Computational Linguistics Posters. 2010: 710-718.
|
[21] |
Quan X J, Kit C, Song Y. Non-monotonic Sentence Alignment via Semisupervised Learning[C]// Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 2013: 622-630.
|
[22] |
Grover J, Mitra P. Bilingual Word Embeddings with Bucketed CNN for Parallel Sentence Extraction[C]// Proceedings of the Association for Computational Linguistics 2017, Student Research Workshop. 2017: 11-16.
|
[23] |
Bouamor H, Sajjad H. H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings[C]// Proceedings of the Workshop on Building and Using Comparable Corpora. 2018.
|
[24] |
Guo M, Shen Q L, Yang Y F, et al. Effective Parallel Corpus Mining Using Bilingual Sentence Embeddings[C]// Proceedings of the 3rd Conference on Machine Translation. 2018: 165-176.
|
[25] |
Hassan H, Aue A, Chen C, et al. Achieving Human Parity on Automatic Chinese to English News Translation[OL]. arXiv Preprint, arXiv: 1803.05567.
|
[26] |
Schwenk H. Filtering and Mining Parallel Data in a Joint Multilingual Space[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 228-234.
|
[27] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 5998-6008.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|