Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (7): 48-58    DOI: 10.11925/infotech.2096-3467.2021.0033
Sentence Alignment Method Based on BERT and Multi-similarity Fusion
Liu Wenbin,He Yanqing(),Wu Zhenfeng,Dong Cheng
Institute of Scientific and Technical Information of China, Beijing 100038, China
[Objective] This paper proposes a method automatically aligning bilingual sentences, aiming to provide technical support for constructing bilingual parallel corpus, cross-language information retrieval and other natural language processing tasks. [Methods] First, we added the BERT pre-training to the method of sentence alignment, and extracted features with a two-way Transformer. Then, we represented the words’ semantics with Position embeddings, Token embeddings, and Segment embeddings. Third, we bi-directionally measured the source language sentence and its translation, as well as the target language sentence and its translation. Finally, we combined the BLEU score, cosine similarity and Manhattan distance to generate the final sentence alignment. [Results] We conducted two rounds of tests to evaluate the effectiveness of the new method. In the parallel corpus filtering task, the recall was 97.84%. In the comparable corpus filtering task, the accuracy reached 99.47%, 98.31%, and 95.00%, when the noise ratio was 20%, 50%, and 90%, respectively. [Limitations] The text representation and similarity calculation could be further improved by adding more semantic information. [Conclusions] The proposed method, which is better than the baseline systems in parallel corpus filtering and comparable corpus filtering tasks, could generate large scale and high-quality parallel corpus.

Key wordsBERT      Machine Translation      Sentence Alignment      Parallel Corpus      Multi-similarity Fusion     
Received: 12 January 2021      Published: 02 April 2021
ZTFLH:  G351  
Fund:Key Project of Institute of Scientific and Technical Information of China(ZD2020-18)
Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Cheng. Sentence Alignment Method Based on BERT and Multi-similarity Fusion. Data Analysis and Knowledge Discovery, 2021, 5(7): 48-58.

The Overall Framework of This Paper
Transformer Model Based on Self-attention Mechanism
Visual Representation of BERT Embedding
Trend Chart of Indicators Changing with the Weight of BLEU Value
Trend Chart of Each Index Changing with the Weight of Cosine Similarity
Trend Chart of Each Indicator Changing with the Weight of Manhattan Distance
Trend Chart of Indicators Changing with Threshold
方法 精确率/% 召回率/% F1值/%
BLEU 52.47 85.00 64.89
BERT+Mah 10.00 100.00 18.18
BERT+Cos 11.07 98.00 19.90
BERT+Cos+Mah 10.56 98.00 19.07
BERT+BLEU+Cos 21.34 99.00 35.11
BERT+BLEU+Mah 45.54 97.00 61.98
BERT-Multi-Sim 95.00 95.00 95.00
Experimental Results under Different Similarities
方法 精确率/% 召回率/% F1值/%
BLEUalign 100.00 94.60 97.23
Champollion 100.00 95.30 97.59
BERT-Multi-Sim 100.00 97.84 98.91
Parallel Corpus Filtering Effect
噪声比率 方法 精确率/% 召回率/% F1值/%
20% BLEUalign 58.76 50.75 54.46
Champollion 78.41 84.00 81.11
BERT-Multi-Sim 99.47 94.38 96.86
50% BLEUalign 89.53 80.40 84.72
Champollion 55.21 81.60 65.86
BERT-Multi-Sim 98.31 92.80 95.47
90% BLEUalign 52.47 85.00 64.89
Champollion 13.76 82.00 23.56
BERT-Multi-Sim 95.00 95.00 95.00
Comparable Corpus Filtering Effect
