Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (7): 48-58     https://doi.org/10.11925/infotech.2096-3467.2021.0033
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于BERT和多相似度融合的句子对齐方法研究*
刘文斌,何彦青(),吴振峰,董诚
中国科学技术信息研究所 北京 100038
Sentence Alignment Method Based on BERT and Multi-similarity Fusion
Liu Wenbin,He Yanqing(),Wu Zhenfeng,Dong Cheng
Institute of Scientific and Technical Information of China, Beijing 100038, China
全文: PDF (974 KB)   HTML ( 8
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 实现双语句子的自动对齐,为构建双语平行语料库、跨语言信息检索等自然语言处理任务提供技术支持。【方法】 将BERT预训练引入句子对齐方法中,通过双向Transformer提取特征,每一个词汇由位置嵌入向量、单词嵌入向量、句子切分嵌入向量三种向量叠加表征词汇的语义信息,进而对源语言与译文、目标语言与译文实施双向度量,融合BLEU得分、余弦相似度和曼哈顿距离三种相似度进行句子对齐。【结果】 通过两种任务验证方法的有效性。在平行语料库过滤任务中,召回率为97.84%;在可比语料过滤任务中,当噪声比率分别为20%、50%、90%时,精确率依次为99.47%、98.31%、95.00%。【局限】 文本向量化与相似度计算方法可以采用更具有语义表征的方式进行改进。【结论】 本方法在平行语料过滤和可比语料过滤两个任务中均优于基线系统,能够获得大规模、高质量的平行语料。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
刘文斌
何彦青
吴振峰
董诚
关键词 BERT机器翻译句子对齐平行语料多相似度融合    
Abstract

[Objective] This paper proposes a method automatically aligning bilingual sentences, aiming to provide technical support for constructing bilingual parallel corpus, cross-language information retrieval and other natural language processing tasks. [Methods] First, we added the BERT pre-training to the method of sentence alignment, and extracted features with a two-way Transformer. Then, we represented the words’ semantics with Position embeddings, Token embeddings, and Segment embeddings. Third, we bi-directionally measured the source language sentence and its translation, as well as the target language sentence and its translation. Finally, we combined the BLEU score, cosine similarity and Manhattan distance to generate the final sentence alignment. [Results] We conducted two rounds of tests to evaluate the effectiveness of the new method. In the parallel corpus filtering task, the recall was 97.84%. In the comparable corpus filtering task, the accuracy reached 99.47%, 98.31%, and 95.00%, when the noise ratio was 20%, 50%, and 90%, respectively. [Limitations] The text representation and similarity calculation could be further improved by adding more semantic information. [Conclusions] The proposed method, which is better than the baseline systems in parallel corpus filtering and comparable corpus filtering tasks, could generate large scale and high-quality parallel corpus.

Key wordsBERT    Machine Translation    Sentence Alignment    Parallel Corpus    Multi-similarity Fusion
收稿日期: 2021-01-12      出版日期: 2021-04-02
ZTFLH:  G351  
基金资助:*中国科学技术信息研究所重点工作项目(ZD2020-18)
通讯作者: 何彦青,ORCID:0000-0002-8791-1581     E-mail: heyq@istic.ac.cn
引用本文:   
刘文斌, 何彦青, 吴振峰, 董诚. 基于BERT和多相似度融合的句子对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Cheng. Sentence Alignment Method Based on BERT and Multi-similarity Fusion. Data Analysis and Knowledge Discovery, 2021, 5(7): 48-58.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0033      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I7/48
Fig.1  本文整体框架
Fig.2  基于自注意力机制的Transformer模型结构
Fig.3  BERT Embedding可视化表示
Fig.4  各指标随BLEU值权重变化趋势
Fig.5  各指标随余弦相似度权重变化趋势
Fig.6  各指标随曼哈顿距离权重变化趋势
Fig.7  各指标随阈值变化趋势
方法 精确率/% 召回率/% F1值/%
BLEU 52.47 85.00 64.89
BERT+Mah 10.00 100.00 18.18
BERT+Cos 11.07 98.00 19.90
BERT+Cos+Mah 10.56 98.00 19.07
BERT+BLEU+Cos 21.34 99.00 35.11
BERT+BLEU+Mah 45.54 97.00 61.98
BERT-Multi-Sim 95.00 95.00 95.00
Table 1  不同相似度方法实验结果
方法 精确率/% 召回率/% F1值/%
BLEUalign 100.00 94.60 97.23
Champollion 100.00 95.30 97.59
BERT-Multi-Sim 100.00 97.84 98.91
Table 2  平行语料库过滤效果
噪声比率 方法 精确率/% 召回率/% F1值/%
20% BLEUalign 58.76 50.75 54.46
Champollion 78.41 84.00 81.11
BERT-Multi-Sim 99.47 94.38 96.86
50% BLEUalign 89.53 80.40 84.72
Champollion 55.21 81.60 65.86
BERT-Multi-Sim 98.31 92.80 95.47
90% BLEUalign 52.47 85.00 64.89
Champollion 13.76 82.00 23.56
BERT-Multi-Sim 95.00 95.00 95.00
Table 3  可比语料库过滤效果
[1] Devlin J, Zbib R, Huang Z Q, et al. Fast and Robust Neural Network Joint Models for Statistical Machine Translation[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 1370-1380.
[2] Vogel S, Tribble A. Improving Statistical Machine Translation for a Speech-to-Speech Translation Task[C]// Proceedings of the 7th International Conference on Spoken Language Processing. 2002: 1901-1904.
[3] Klavans J L, Tzoukermann E. The BICORD System Combining Lexical Information from Bilingual Corpora and Machine Readable Dictionaries[C]// Proceedings of the 13th International Conference on Computational Linguistics. 1990: 174-179.
[4] 张霞, 昝红英, 张恩展. 汉英句子对齐长度计算方法的研究[J]. 计算机工程与设计, 2009, 30(18):4356-4358.
[4] (Zhang Xia, Zan Hongying, Zhang Enzhan. Study on Length Computation Method of Chinese-English Sentence Alignment[J]. Computer Engineering and Design, 2009, 30(18):4356-4358.)
[5] Kraaij W, Nie J Y, Simard M. Embedding Web-based Statistical Translation Models in Cross-language Information Retrieval[J]. Computational Linguistics, 2003, 29(3):381-419.
doi: 10.1162/089120103322711587
[6] Nie J Y, Simard M, Isabelle P, et al. Cross-language Information Retrieval Based on Parallel Texts and Automatic Mining of Parallel Texts from the Web[C]// Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999: 74-81.
[7] 田科, 张家俊. 基于预训练模型的机器翻译译文检测方法[J]. 情报工程, 2020, 6(5):15-26.
[7] (Tian Ke, Zhang Jiajun. Machine-Translated Text Detection Method Based on Pre-trained Model[J]. Technology Intelligence Engineering, 2020, 6(5):15-26.)
[8] 王雅松, 刘明童, 马彬彬, 等. 基于多翻译引擎的汉语复述平行语料构建方法[J]. 情报工程, 2020, 6(5):27-40.
[8] (Wang Yasong, Liu Mingtong, Ma Binbin, et al. The Construction of Chinese Paraphrase Parallel Corpus Based on Multiple Translation Engines[J]. Technology Intelligence Engineering, 2020, 6(5):27-40.)
[9] Grégoire F, Langlais P. A Deep Neural Network Approach to Parallel Sentence Extraction[OL]. arXiv Preprint, arXiv: 1709. 09783.
[10] Gale W A, Church K. A Program for Aligning Sentences in Bilingual Corpora[J]. Computational Linguistics, 1993, 19(1):75-102.
[11] Brown P F, Lai J C, Mercer R L. Aligning Sentences in Parallel Corpora[C]// Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. 1991: 169-176.
[12] Simard M, Foster G F, Isabelle P. Using Cognates to Align Sentences in Bilingual Corpora[C]// Proceedings of the 1993 Conference of the Centre for Advanced Studies on Collaborative Research. 1993: 1071-1082.
[13] Church K. Char_align: A Program for Aligning Parallel Texts at the Character Level[C]// Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. 1993: 1-8.
[14] Kay M, Roscheisen M. Text-translation Alignment[J]. Computational Linguistics, 1993, 19(1):121-142.
[15] Chen S F. Aligning Sentences in Bilingual Corpora Using Lexical Information[C]// Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. 1993: 9-16.
[16] Sennrich R, Volk M. MT-based Sentence Alignment for OCR-generated Parallel Texts[C]// Proceedings of the 9th Conference of the Association for Machine Translation in the Americas. 2010. DOI: 10.5167/uzh-38464.
doi: 10.5167/uzh-38464
[17] Wu D K. Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria[C]// Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 1994:80-87.
[18] Moore R C. Fast and Accurate Sentence Alignment of Bilingual Corpora[C]// Proceedings of the 5th Conference of the Association for Machine Translation in the Americas. Springer, Berlin, Heidelberg, 2002: 135-144.
[19] Ma X Y. Champollion: A Robust Parallel Text Sentence Aligner[C]// Proceedings of the 5th International Conference on Language Resources and Evaluation. 2006: 489-492.
[20] Li P, Sun M S, Xue P. Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm[C]// Proceedings of the 23rd International Conference on Computational Linguistics Posters. 2010: 710-718.
[21] Quan X J, Kit C, Song Y. Non-monotonic Sentence Alignment via Semisupervised Learning[C]// Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 2013: 622-630.
[22] Grover J, Mitra P. Bilingual Word Embeddings with Bucketed CNN for Parallel Sentence Extraction[C]// Proceedings of the Association for Computational Linguistics 2017, Student Research Workshop. 2017: 11-16.
[23] Bouamor H, Sajjad H. H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings[C]// Proceedings of the Workshop on Building and Using Comparable Corpora. 2018.
[24] Guo M, Shen Q L, Yang Y F, et al. Effective Parallel Corpus Mining Using Bilingual Sentence Embeddings[C]// Proceedings of the 3rd Conference on Machine Translation. 2018: 165-176.
[25] Hassan H, Aue A, Chen C, et al. Achieving Human Parity on Automatic Chinese to English News Translation[OL]. arXiv Preprint, arXiv: 1803.05567.
[26] Schwenk H. Filtering and Mining Parallel Data in a Joint Multilingual Space[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 228-234.
[27] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 5998-6008.
[1] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] 马江微, 吕学强, 游新冬, 肖刚, 韩君妹. 融合BERT与关系位置特征的军事领域关系抽取方法*[J]. 数据分析与知识发现, 2021, 5(8): 1-12.
[4] 李文娜, 张智雄. 基于联合语义表示的不同知识库中的实体对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[5] 王昊, 林克柔, 孟镇, 李心蕾. 文本表示及其特征生成对法律判决书中多类型实体识别的影响分析[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[6] 喻雪寒, 何琳, 徐健. 基于RoBERTa-CRF的古文历史事件抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 26-35.
[7] 陆泉, 何超, 陈静, 田敏, 刘婷. 基于两阶段迁移学习的多标签分类模型研究*[J]. 数据分析与知识发现, 2021, 5(7): 91-100.
[8] 尹鹏博,潘伟民,张海军,陈德刚. 基于BERT-BiGA模型的标题党新闻识别研究*[J]. 数据分析与知识发现, 2021, 5(6): 126-134.
[9] 宋若璇,钱力,杜宇. 基于科技论文中未来工作句集的学术创新构想话题自动生成方法研究*[J]. 数据分析与知识发现, 2021, 5(5): 10-20.
[10] 胡昊天,吉晋锋,王东波,邓三鸿. 基于深度学习的食品安全事件实体一体化呈现平台构建*[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[11] 王倩,王东波,李斌,许超. 面向海量典籍文本的深度学习自动断句与标点平台构建研究*[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
[12] 常城扬,王晓东,张胜磊. 基于深度学习方法对特定群体推特的动态政治情感极性分析*[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[13] 董淼, 苏中琪, 周晓北, 兰雪, 崔志刚, 崔雷. 利用Text-CNN改进PubMedBERT在化学诱导性疾病实体关系分类效果的尝试[J]. 数据分析与知识发现, 2021, 5(11): 145-152.
[14] 刘欢,张智雄,王宇飞. BERT模型的主要优化改进方法研究综述*[J]. 数据分析与知识发现, 2021, 5(1): 3-15.
[15] 梁继文,江川,王东波. 基于多特征融合的先秦典籍汉英句子对齐研究*[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn