Data Analysis and Knowledge Discovery
Research on sentence alignment based on BERT and multi-similarity fusion
Liu Wenbin,He Yanqing,Wu Zhenfeng,Dong Chen
(Institute of Scientific and Technical Information of China, Beijing 100038, China)
[Objective]Sentence alignment technology aims to provide large-scale and high-quality parallel sentence pairs for cross-language natural language processing tasks.

[Methods]In this paper BERT pre-training is introduced into the method of sentence alignment where features are extracted through a two-way Transformer. Each word is composed of three kinds of embeddings: Position embeddings, Token embeddings, and Segment embeddings. The three embeddings is added as the final word vector to represent the semantic information of the word. The source language sentence and its translation, the target language sentence and its translation are measured bi-directionally, and the BLEU score, cosine similarity and Manhattan distance are combined to obtain the final sentence alignment.

[Results]In this paper two tasks were used to verify the effectiveness of the method. In the parallel corpus filtering task, the recall rate is 97.84%; in the comparable corpus filtering task, the accuracy rate is 99.47%, 98.31%, and 95% respectively when the noise ratio is 20%, 50%, and 90%.

[Limitations]The methods of text representation and similarity calculation need further improvement to obtain more semantic information.

[Conclusions]The method proposed in this paper is far superior to the baseline system in parallel corpus filtering task and comparable corpus filtering task. So it can obtain large scale and high-quality parallel corpus.

Key words BERT      Machine Translation      Sentence Alignment      Parallel Corpus      multi-similarity fusion      
Published: 02 April 2021
Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Chen. Research on sentence alignment based on BERT and multi-similarity fusion . Data Analysis and Knowledge Discovery, 0, (): 1-.

