Please wait a minute...
Data Analysis and Knowledge Discovery
Current Issue | Archive | Adv Search |
Research on sentence alignment based on BERT and multi-similarity fusion
Liu Wenbin,He Yanqing,Wu Zhenfeng,Dong Chen
(Institute of Scientific and Technical Information of China, Beijing 100038, China)
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective]Sentence alignment technology aims to provide large-scale and high-quality parallel sentence pairs for cross-language natural language processing tasks.

[Methods]In this paper BERT pre-training is introduced into the method of sentence alignment where features are extracted through a two-way Transformer. Each word is composed of three kinds of embeddings: Position embeddings, Token embeddings, and Segment embeddings. The three embeddings is added as the final word vector to represent the semantic information of the word. The source language sentence and its translation, the target language sentence and its translation are measured bi-directionally, and the BLEU score, cosine similarity and Manhattan distance are combined to obtain the final sentence alignment.

[Results]In this paper two tasks were used to verify the effectiveness of the method. In the parallel corpus filtering task, the recall rate is 97.84%; in the comparable corpus filtering task, the accuracy rate is 99.47%, 98.31%, and 95% respectively when the noise ratio is 20%, 50%, and 90%.

[Limitations]The methods of text representation and similarity calculation need further improvement to obtain more semantic information.

[Conclusions]The method proposed in this paper is far superior to the baseline system in parallel corpus filtering task and comparable corpus filtering task. So it can obtain large scale and high-quality parallel corpus.


Key words BERT      Machine Translation      Sentence Alignment      Parallel Corpus      multi-similarity fusion      
Published: 02 April 2021
ZTFLH:  G351  

Cite this article:

Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Chen. Research on sentence alignment based on BERT and multi-similarity fusion . Data Analysis and Knowledge Discovery, 0, (): 1-.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0033     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y0/V/I/1

[1] Chang Chengyang,Wang Xiaodong,Zhang Shenglei. Polarity Analysis of Dynamic Political Sentiments from Tweets with Deep Learning Method[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[2] Hu Haotian,Ji Jinfeng,Wang Dongbo,Deng Sanhong. An Integrated Platform for Food Safety Incident Entities Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[3] Wang Qian,Wang Dongbo,Li Bin,Xu Chao. Deep Learning Based Automatic Sentence Segmentation and Punctuation Model for Massive Classical Chinese Literature[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
[4] Liu Huan,Zhang Zhixiong,Wang Yufei. A Review on Main Optimization Methods of BERT[J]. 数据分析与知识发现, 2021, 5(1): 3-15.
[5] Liang Jiwen,Jiang Chuan,Wang Dongbo. Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
[6] Zhao Yang, Zhang Zhixiong, Liu Huan, Ding Liangping. Classification of Chinese Medical Literature with BERT Model[J]. 数据分析与知识发现, 2020, 4(8): 41-49.
[7] Shi Lei,Wang Yi,Cheng Ying,Wei Ruibin. Review of Attention Mechanism in Natural Language Processing[J]. 数据分析与知识发现, 2020, 4(5): 1-14.
[8] Zhao Ping,Sun Lianying,Tu Shuai,Bian Jianling,Wan Ying. Identifying Scenic Spot Entities Based on Improved Knowledge Transfer[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
[9] Zhang Dongyu,Cui Zijuan,Li Yingxia,Zhang Wei,Lin Hongfei. Identifying Noun Metaphors with Transformer and BERT[J]. 数据分析与知识发现, 2020, 4(4): 100-108.
[10] Liu Liu,Qin Tianyun,Wang Dongbo. Automatic Extraction of Traditional Music Terms of Intangible Cultural Heritage[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[11] Qingmin Liu,Changqing Yao,Chongde Shi,Xiaojie Wen,Yueying Sun. Vocabulary Optimization of Neural Machine Translation for Scientific and Technical Document[J]. 数据分析与知识发现, 2019, 3(3): 76-82.
[12] Meishan Chen,Chenxi Xia. Identifying Entities of Online Questions from Cancer Patients Based on Transfer Learning[J]. 数据分析与知识发现, 2019, 3(12): 61-69.
[13] Xia Lixin, Cai Xin, Shi Yijin, Sun Danxia, Wang Zhongyi. Organization and Visualization of Web Life Service Information Research[J]. 现代图书情报技术, 2014, 30(4): 85-91.
[14] Shao Jian, Zhang Chengzhi. Automatic Acquisition of Domain Parallel Corpora from Internet[J]. 现代图书情报技术, 2014, 30(12): 36-43.
[15] Shi Chongde, Qiao Xiaodong, Wang Huilin. Decoding Optimization in Tree Transducer based Translation Model[J]. 现代图书情报技术, 2013, 29(9): 23-29.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn