Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (7): 48-58    DOI: 10.11925/infotech.2096-3467.2021.0033
Current Issue | Archive | Adv Search |
Sentence Alignment Method Based on BERT and Multi-similarity Fusion
Liu Wenbin,He Yanqing(),Wu Zhenfeng,Dong Cheng
Institute of Scientific and Technical Information of China, Beijing 100038, China
Download: PDF (974 KB)   HTML ( 9
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a method automatically aligning bilingual sentences, aiming to provide technical support for constructing bilingual parallel corpus, cross-language information retrieval and other natural language processing tasks. [Methods] First, we added the BERT pre-training to the method of sentence alignment, and extracted features with a two-way Transformer. Then, we represented the words’ semantics with Position embeddings, Token embeddings, and Segment embeddings. Third, we bi-directionally measured the source language sentence and its translation, as well as the target language sentence and its translation. Finally, we combined the BLEU score, cosine similarity and Manhattan distance to generate the final sentence alignment. [Results] We conducted two rounds of tests to evaluate the effectiveness of the new method. In the parallel corpus filtering task, the recall was 97.84%. In the comparable corpus filtering task, the accuracy reached 99.47%, 98.31%, and 95.00%, when the noise ratio was 20%, 50%, and 90%, respectively. [Limitations] The text representation and similarity calculation could be further improved by adding more semantic information. [Conclusions] The proposed method, which is better than the baseline systems in parallel corpus filtering and comparable corpus filtering tasks, could generate large scale and high-quality parallel corpus.

Key wordsBERT      Machine Translation      Sentence Alignment      Parallel Corpus      Multi-similarity Fusion     
Received: 12 January 2021      Published: 02 April 2021
ZTFLH:  G351  
Fund:Key Project of Institute of Scientific and Technical Information of China(ZD2020-18)
Corresponding Authors: He Yanqing,ORCID:0000-0002-8791-1581     E-mail: heyq@istic.ac.cn

Cite this article:

Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Cheng. Sentence Alignment Method Based on BERT and Multi-similarity Fusion. Data Analysis and Knowledge Discovery, 2021, 5(7): 48-58.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0033     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I7/48

The Overall Framework of This Paper
Transformer Model Based on Self-attention Mechanism
Visual Representation of BERT Embedding
Trend Chart of Indicators Changing with the Weight of BLEU Value
Trend Chart of Each Index Changing with the Weight of Cosine Similarity
Trend Chart of Each Indicator Changing with the Weight of Manhattan Distance
Trend Chart of Indicators Changing with Threshold
方法 精确率/% 召回率/% F1值/%
BLEU 52.47 85.00 64.89
BERT+Mah 10.00 100.00 18.18
BERT+Cos 11.07 98.00 19.90
BERT+Cos+Mah 10.56 98.00 19.07
BERT+BLEU+Cos 21.34 99.00 35.11
BERT+BLEU+Mah 45.54 97.00 61.98
BERT-Multi-Sim 95.00 95.00 95.00
Experimental Results under Different Similarities
方法 精确率/% 召回率/% F1值/%
BLEUalign 100.00 94.60 97.23
Champollion 100.00 95.30 97.59
BERT-Multi-Sim 100.00 97.84 98.91
Parallel Corpus Filtering Effect
噪声比率 方法 精确率/% 召回率/% F1值/%
20% BLEUalign 58.76 50.75 54.46
Champollion 78.41 84.00 81.11
BERT-Multi-Sim 99.47 94.38 96.86
50% BLEUalign 89.53 80.40 84.72
Champollion 55.21 81.60 65.86
BERT-Multi-Sim 98.31 92.80 95.47
90% BLEUalign 52.47 85.00 64.89
Champollion 13.76 82.00 23.56
BERT-Multi-Sim 95.00 95.00 95.00
Comparable Corpus Filtering Effect
[1] Devlin J, Zbib R, Huang Z Q, et al. Fast and Robust Neural Network Joint Models for Statistical Machine Translation[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 1370-1380.
[2] Vogel S, Tribble A. Improving Statistical Machine Translation for a Speech-to-Speech Translation Task[C]// Proceedings of the 7th International Conference on Spoken Language Processing. 2002: 1901-1904.
[3] Klavans J L, Tzoukermann E. The BICORD System Combining Lexical Information from Bilingual Corpora and Machine Readable Dictionaries[C]// Proceedings of the 13th International Conference on Computational Linguistics. 1990: 174-179.
[4] 张霞, 昝红英, 张恩展. 汉英句子对齐长度计算方法的研究[J]. 计算机工程与设计, 2009, 30(18):4356-4358.
[4] (Zhang Xia, Zan Hongying, Zhang Enzhan. Study on Length Computation Method of Chinese-English Sentence Alignment[J]. Computer Engineering and Design, 2009, 30(18):4356-4358.)
[5] Kraaij W, Nie J Y, Simard M. Embedding Web-based Statistical Translation Models in Cross-language Information Retrieval[J]. Computational Linguistics, 2003, 29(3):381-419.
doi: 10.1162/089120103322711587
[6] Nie J Y, Simard M, Isabelle P, et al. Cross-language Information Retrieval Based on Parallel Texts and Automatic Mining of Parallel Texts from the Web[C]// Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999: 74-81.
[7] 田科, 张家俊. 基于预训练模型的机器翻译译文检测方法[J]. 情报工程, 2020, 6(5):15-26.
[7] (Tian Ke, Zhang Jiajun. Machine-Translated Text Detection Method Based on Pre-trained Model[J]. Technology Intelligence Engineering, 2020, 6(5):15-26.)
[8] 王雅松, 刘明童, 马彬彬, 等. 基于多翻译引擎的汉语复述平行语料构建方法[J]. 情报工程, 2020, 6(5):27-40.
[8] (Wang Yasong, Liu Mingtong, Ma Binbin, et al. The Construction of Chinese Paraphrase Parallel Corpus Based on Multiple Translation Engines[J]. Technology Intelligence Engineering, 2020, 6(5):27-40.)
[9] Grégoire F, Langlais P. A Deep Neural Network Approach to Parallel Sentence Extraction[OL]. arXiv Preprint, arXiv: 1709. 09783.
[10] Gale W A, Church K. A Program for Aligning Sentences in Bilingual Corpora[J]. Computational Linguistics, 1993, 19(1):75-102.
[11] Brown P F, Lai J C, Mercer R L. Aligning Sentences in Parallel Corpora[C]// Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. 1991: 169-176.
[12] Simard M, Foster G F, Isabelle P. Using Cognates to Align Sentences in Bilingual Corpora[C]// Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research. 1993: 1071-1082.
[13] Church K. Char_align: A Program for Aligning Parallel Texts at the Character Level[C]// Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. 1993: 1-8.
[14] Kay M, Roscheisen M. Text-translation Alignment[J]. Computational Linguistics, 1993, 19(1):121-142.
[15] Chen S F. Aligning Sentences in Bilingual Corpora Using Lexical Information[C]// Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. 1993: 9-16.
[16] Sennrich R, Volk M. MT-based Sentence Alignment for OCR-generated Parallel Texts[C]// Proceedings of the 9th Conference of the Association for Machine Translation in the Americas. 2010. DOI: 10.5167/uzh-38464.
doi: 10.5167/uzh-38464
[17] Wu D K. Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria[C]// Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 1994:80-87.
[18] Moore R C. Fast and Accurate Sentence Alignment of Bilingual Corpora[C]// Proceedings of the 5th Conference of the Association for Machine Translation in the Americas. Springer, Berlin, Heidelberg, 2002: 135-144.
[19] Ma X Y. Champollion: A Robust Parallel Text Sentence Aligner[C]// Proceedings of the 5th International Conference on Language Resources and Evaluation. 2006: 489-492.
[20] Li P, Sun M S, Xue P. Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm[C]// Proceedings of the 23rd International Conference on Computational Linguistics Posters. 2010: 710-718.
[21] Quan X J, Kit C, Song Y. Non-monotonic Sentence Alignment via Semisupervised Learning[C]// Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 2013: 622-630.
[22] Grover J, Mitra P. Bilingual Word Embeddings with Bucketed CNN for Parallel Sentence Extraction[C]// Proceedings of the Association for Computational Linguistics 2017, Student Research Workshop. 2017: 11-16.
[23] Bouamor H, Sajjad H. H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings[C]// Proceedings of the Workshop on Building and Using Comparable Corpora. 2018.
[24] Guo M, Shen Q L, Yang Y F, et al. Effective Parallel Corpus Mining Using Bilingual Sentence Embeddings[C]// Proceedings of the 3rd Conference on Machine Translation. 2018: 165-176.
[25] Hassan H, Aue A, Chen C, et al. Achieving Human Parity on Automatic Chinese to English News Translation[OL]. arXiv Preprint, arXiv: 1803.05567.
[26] Schwenk H. Filtering and Mining Parallel Data in a Joint Multilingual Space[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 228-234.
[27] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 5998-6008.
[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] Ma Jiangwei, Lv Xueqiang, You Xindong, Xiao Gang, Han Junmei. Extracting Relationship Among Military Domains with BERT and Relation Position Features[J]. 数据分析与知识发现, 2021, 5(8): 1-12.
[4] Li Wenna, Zhang Zhixiong. Entity Alignment Method for Different Knowledge Repositories with Joint Semantic Representation[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[5] Wang Hao, Lin Kerou, Meng Zhen, Li Xinlei. Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[6] Yu Xuehan, He Lin, Xu Jian. Extracting Events from Ancient Books Based on RoBERTa-CRF[J]. 数据分析与知识发现, 2021, 5(7): 26-35.
[7] Lu Quan, He Chao, Chen Jing, Tian Min, Liu Ting. A Multi-Label Classification Model with Two-Stage Transfer Learning[J]. 数据分析与知识发现, 2021, 5(7): 91-100.
[8] Yin Pengbo,Pan Weimin,Zhang Haijun,Chen Degang. Identifying Clickbait with BERT-BiGA Model[J]. 数据分析与知识发现, 2021, 5(6): 126-134.
[9] Song Ruoxuan,Qian Li,Du Yu. Identifying Academic Creative Concept Topics Based on Future Work of Scientific Papers[J]. 数据分析与知识发现, 2021, 5(5): 10-20.
[10] Hu Haotian,Ji Jinfeng,Wang Dongbo,Deng Sanhong. An Integrated Platform for Food Safety Incident Entities Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[11] Wang Qian,Wang Dongbo,Li Bin,Xu Chao. Deep Learning Based Automatic Sentence Segmentation and Punctuation Model for Massive Classical Chinese Literature[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
[12] Chang Chengyang,Wang Xiaodong,Zhang Shenglei. Polarity Analysis of Dynamic Political Sentiments from Tweets with Deep Learning Method[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[13] Dong Miao, Su Zhongqi, Zhou Xiaobei, Lan Xue, Cui Zhigang, Cui Lei. Improving PubMedBERT for CID-Entity-Relation Classification Using Text-CNN[J]. 数据分析与知识发现, 2021, 5(11): 145-152.
[14] Liu Huan,Zhang Zhixiong,Wang Yufei. A Review on Main Optimization Methods of BERT[J]. 数据分析与知识发现, 2021, 5(1): 3-15.
[15] Liang Jiwen,Jiang Chuan,Wang Dongbo. Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn