Please wait a minute...
Advanced Search
数据分析与知识发现  2024, Vol. 8 Issue (6): 56-68     https://doi.org/10.11925/infotech.2096-3467.2023.0350
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
LingAlign:基于跨语言句向量的多语种句对齐方法研究*
刘磊1,梁茂成2()
1燕山大学外国语学院 秦皇岛 066004
2北京航空航天大学外国语学院 北京 100191
LingAlign: A Multilingual Sentence Aligner Using Cross-Lingual Sentence Embeddings
Liu Lei1,Liang Maocheng2()
1School of Foreign Studies, Yanshan University, Qinhuangdao 066004, China
2School of Foreign Languages, Beihang University, Beijing 100191, China
全文: PDF (1246 KB)   HTML ( 4
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】实现多语种句子的自动对齐,为基于平行语料库的数字人文和机器翻译研究提供支持。【方法】采用跨语言句向量技术,将待对齐的双语文本映射到一个共享的向量空间,基于双轮动态规划和改进版余弦相似度算法抽取双语文本中的平行句对。【结果】通过直接评测和间接评测两种方式评估系统性能:直接评测的平均准确率、召回率和F1值分别为0.950、0.960和0.955;间接评测的chrF、chrF++和COMET值分别为55.65、55.85和87.31。【局限】融合文档对齐和句子对齐的语料采集平台有待开发。【结论】所提方法在两类评测任务中的性能均优于现有方法,有助于构建大规模、高质量的多语种平行语料库。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
刘磊
梁茂成
关键词 跨语言句向量自动句对齐神经机器翻译    
Abstract

[Objective] This paper develops a multilingual sentence aligner for parallel corpora-based research in digital humanities and machine translation. [Methods] The system first encodes the bitext to be aligned in a shared vector space, and then calculates the semantic relationship between sentences based on modified cosine similarity. Finally, a two-stage dynamic programming algorithm is used to automatically extract parallel sentence pairs. [Results] We use both intrinsic and extrinsic evaluation to calculate the performance of the system. The intrinsic evaluation shows that the average accuracy, recall and F1 values reached 0.950, 0.960 and 0.955. Furthermore, the chrF, chrF++ and COMET scores achieved in the extrinsic evaluation are 55.65, 55.85 and 87.31 respectively. [Limitations] A data capture platform that integrates document alignment and sentence alignment is yet to be developed. [Conclusions] The proposed approach outperforms existing methods in both intrinsic and extrinsic evaluation tasks, which may help to promote the construction of large and high quality multilingual parallel corpora.

Key wordsCross-Lingual Sentence Embeddings    Automatic Sentence Alignment    Neural Machine Translation
收稿日期: 2023-04-19      出版日期: 2024-03-15
ZTFLH:  H085  
  TP391  
基金资助:*国家社会科学基金项目(19BYY082);河北省社会科学发展研究重点课题(20230104006);教育部人文社会科学研究项目(17YJC740055)
通讯作者: 梁茂成,ORCID:0000-0002-1678-7122,E-mail:frankliang0086@163.com。   
引用本文:   
刘磊, 梁茂成. LingAlign:基于跨语言句向量的多语种句对齐方法研究*[J]. 数据分析与知识发现, 2024, 8(6): 56-68.
Liu Lei, Liang Maocheng. LingAlign: A Multilingual Sentence Aligner Using Cross-Lingual Sentence Embeddings. Data Analysis and Knowledge Discovery, 2024, 8(6): 56-68.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2023.0350      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2024/V8/I6/56
对应形式 汉语原文 英语译文
一对一 s1:“一场梦,醒了而已。” t1:“I’m just waking up from a dream.”
一对多 s2:叶文洁说完又笑了笑,抱着那摞复印纸和信封走出了办公室。 t2:Ye smiled again.
t3:She took the stack of photocopies and the envelope and left the office.
多对多 s3:她回到住处,取了饭盒去食堂,才发现只剩下馒头和咸菜了。 t4:She went back to her room,picked up her lunch box,and went to the cafeteria.
s4:食堂的人又没好气地告诉她要关门了,她只好端着饭盒走了出来,走到那道悬崖前,坐在草地上啃着凉馒头。 t5:Only mantou buns and pickles were left,and the cafeteria workers told her impatiently that they were closing.
t6:So she had no choice but to carry her lunch box outside and walk next to the lip of the cliff,where she sat down on the grass to chew the cold mantou.
Table 1  句对匹配类型示例
系统 数据集 语种 准确率P 召回率R F1
Gale-Church? Berg 德-法 0.67 0.68 0.68
Hunalign* Berg 德-法 0.61 0.73 0.66
Bleualign? Berg 德-法 0.83 0.78 0.81
Vecalign* Berg 德-法 0.89 0.90 0.90
Table 2  现有句对齐系统性能对比
Fig.1  双轮动态规划算法
Fig.2  基于传统和改进版余弦相似度的句对齐方法对比
Fig.3  COMET译文质量评价模型
评测
数据
语料来源 原文语种 原文句数 原文字数 译文语种 译文句数 译文字数
Berg-Dev 瑞士SAC
登山协会年刊
德语 468 8 039 法语 554 10 150
Berg 瑞士SAC
登山协会年刊
德语 991 16 577 法语 1 011 19 718
Gov 《习近平谈
治国理政》
第一卷
汉语 1 797 37 391 英语 2 252 53 772
德语 2 400 54 986
法语 2 217 64 544
日语 2 037 73 035
TB 《三体I:
地球往事》
汉语 1 752 33 577 英语 2 890 40 058
德语 3 105 40 491
法语 2 764 48 092
意语 2 727 40 546
Table 3  直接评测数据概览
句对类型 频率分布 Berg Gov TB 总计
德-法 汉-德 汉-法 汉-日 汉-英 汉-德 汉-法 汉-意 汉-英
1∶0 & 0∶1 频数 58 5 5 1 9 24 19 19 22 162
百分比 6.3% 0.3% 0.3% 0.1% 0.5% 1.5% 1.2% 1.2% 1.3% 1.1%
1∶1 频数 678 1 284 1 310 1 547 1 375 687 743 759 768 9 151
百分比 74.0% 72.8% 75.4% 87.4% 77.8% 42.1% 46.6% 47.3% 46.7% 63.4%
1∶2 & 2∶1 频数 145 342 334 180 289 508 483 522 517 3 320
百分比 15.8% 19.4% 19.2% 10.2% 16.3% 31.1% 30.3% 32.5% 31.4% 23.0%
其他 频数 35 133 88 43 95 414 349 305 339 1 801
百分比 3.8% 7.5% 5.1% 2.4% 5.4% 25.4% 21.9% 19.0% 20.6% 12.5%
总计 频数 916 1 764 1 737 1 771 1 768 1 633 1 594 1 605 1 646 14 434
Table 4  人工对齐句对类型分布
评测数据 语料来源 原文语种 原文句数 原文字数 译文语种 译文句数 译文字数
Sub-Train 爱奇艺视频
海外版网站
汉语 183 078 746 126 越南语 204 494 962 833
Sub-Dev 爱奇艺视频
海外版网站
汉语 2 529 11 176 越南语 2 529 13 395
Sub-Test 爱奇艺视频
海外版网站
汉语 2 856 11 740 越南语 2 856 13 878
Table 5  间接评测数据概览
Fig.4  实验设计
系统 指标 Berg Gov TB
德-法 汉-德 汉-法 汉-日 汉-英 平均值 汉-德 汉-法 汉-意 汉-英 平均值
Gale-Church P 0.672 0.682 0.764 0.891 0.736 0.768 0.177 0.317 0.327 0.311 0.283
R 0.685 0.694 0.770 0.892 0.741 0.774 0.204 0.352 0.363 0.343 0.316
F1 0.678 0.688 0.767 0.891 0.739 0.771 0.190 0.334 0.344 0.326 0.299
Bleualign P 0.851 0.791 0.848 0.914 0.848 0.852 0.502 0.596 0.618 0.688 0.601
R 0.816 0.715 0.801 0.887 0.803 0.802 0.380 0.530 0.563 0.638 0.528
F1 0.833 0.751 0.824 0.900 0.825 0.826 0.433 0.561 0.589 0.662 0.561
Hunalign P 0.798 0.789 0.843 0.900 0.844 0.844 0.464 0.538 0.563 0.620 0.546
R 0.847 0.849 0.886 0.939 0.881 0.889 0.583 0.646 0.666 0.710 0.651
F1 0.822 0.818 0.864 0.919 0.862 0.866 0.517 0.587 0.610 0.662 0.594
Vecalign P 0.899 0.957 0.962 0.977 0.957 0.963 0.855 0.867 0.872 0.919 0.878
R 0.904 0.961 0.966 0.972 0.957 0.964 0.886 0.898 0.902 0.939 0.906
F1 0.902 0.959 0.964 0.975 0.957 0.964 0.870 0.882 0.887 0.929 0.892
LingAlign P 0.941 0.975 0.982 0.992 0.974 0.981 0.910 0.913 0.915 0.933 0.918
R 0.943 0.979 0.982 0.994 0.974 0.982 0.926 0.937 0.939 0.951 0.938
F1 0.942 0.977 0.982 0.993 0.974 0.982 0.918 0.925 0.927 0.942 0.928
Table 6  直接评测结果
错误类型 原文 译文
增译减译 s1:“在运河两岸立两根柱子,柱子之间平行地扯上许多细丝,间距半米左右,这些细丝是汪教授他们制造出来的那种叫‘飞刃’的纳米材料。” t1:“We set up two pillars on the shores of the canal,and then between them we string many parallel,thin filaments,about half a meter apart.
t2:The filaments should be made from the nanomaterial called ‘Flying Blade,’ developed by Professor Wang.
t3:A very appropriate name,in this case.”
交叉对齐 s1:“哈哈哈,又放倒了一个!” t1:Wang’s cries were interrupted by laughter.
s2:汪淼的哭泣被身后的一阵笑声打断,他扭头一看,大史站在那里,嘴里吐出一口白烟。 t2:Hahaha,another one bites the dust!
t3:He turned around.
t4:Captain Shi Qiang stood there,blowing out a mouthful of white smoke.
Table 7  对齐结果错误分析
系统 训练数据 评估指标
句对类型 句对
数量
chrF chrF++ COMET
Gale-
Church
1∶1 156 598 53.61 53.88 86.45
1∶1+1∶2+2∶1 181 377 54.84 55.03 86.67
1∶1+1∶2+2∶1+其他类型 181 386 54.48 54.68 86.68
Bleualign 1∶1 148 635 54.76 54.97 86.69
1∶1+1∶2+2∶1 162 337 54.84 55.02 86.96
1∶1+1∶2+2∶1+其他类型 163 236 54.75 54.97 86.94
Hunalign 1∶1 166 843 54.97 55.19 87.18
1∶1+1∶2+2∶1 178 387 55.10 55.33 87.03
1∶1+1∶2+2∶1+其他类型 178 803 55.03 55.25 87.09
Vecalign 1∶1 154 337 54.73 54.96 87.09
1∶1+1∶2+2∶1 172 414 55.05 55.27 87.16
1∶1+1∶2+2∶1+其他类型 176 998 55.35 55.53 87.17
LingAlign 1∶1 155 558 55.12 55.34 87.16
1∶1+1∶2+2∶1 170 067 55.42 55.65 87.19
1∶1+1∶2+2∶1+其他类型 176 857 55.65 55.85 87.31
Table 8  间接评测结果
系统 语种 对齐时间/s 平行句对数量
Gale-Church 汉-越 25.4 181 386
Bleualign 汉-越 274.6 163 236
Hunalign 汉-越 26.3 178 803
Vecalign 汉-越 142.6 176 998
LingAlign 汉-越 38.7 176 857
Table 9  句对齐系统速度对比
[1] 常宝宝, 俞士汶. 语料库技术及其应用[J]. 外语研究, 2009 (5): 43-51.
[1] (Chang Baobao, Yu Shiwen. Corpus Technology and Its Application[J]. Foreign Languages Research, 2009 (5): 43-51.)
[2] 梁茂成. 什么是语料库语言学[M]. 上海: 上海外语教育出版社, 2016.
[2] (Liang Maocheng. What is Corpus Linguistics?[M]. Shanghai: Shanghai Foreign Language Education Press, 2016.)
[3] 王克非. 语料库翻译学探索[M]. 上海: 上海交通大学出版社, 2012.
[3] (Wang Kefei. Exploring Corpus-based Translation Studies[M]. Shanghai: Shanghai Jiao Tong University Press, 2012.)
[4] Zanettin F. Translation-Driven Corpora: Corpus Resources for Descriptive and Applied Translation Studies[M]. London: Routledge, 2014.
[5] Koehn P. Neural Machine Translation[M]. Cambridge: Cambridge University Press, 2020.
[6] 李晓倩, 胡开宝. 《习近平谈治国理政》多语平行语料库的建设与应用[J]. 外语电化教学, 2021 (3): 83-88, 13.
[6] (Li Xiaoqian, Hu Kaibao. The Multilingual Parallel Corpus of Xi Jinping: The Governance of China: Compilation and Applications[J]. Technology Enhanced Foreign Language Education, 2021 (3): 83-88, 13.)
[7] 梁继文, 江川, 王东波. 基于多特征融合的先秦典籍汉英句子对齐研究[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
[7] (Liang Jiwen, Jiang Chuan, Wang Dongbo. Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion[J]. Data Analysis and Knowledge Discovery, 2020, 4(9): 123-132.)
[8] 王克非. 以汉语为中心语的多语汉外平行语料库集群的研制与应用[J]. 外语教学, 2022, 43(6): 1-7.
[8] (Wang Kefei. Development and Application of a Multilingual Sino-Foreign Parallel Corpora Group with Chinese as the Pivot Language[J]. Foreign Language Education, 2022, 43(6): 1-7.)
[9] Goyal N, Gao C, Chaudhary V, et al. The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation[J]. Transactions of the Association for Computational Linguistics, 2022, 10: 522-538.
[10] Simard M. Building and Using Parallel Text for Translation[M]// The Routledge Handbook of Translation and Technology. London: Routledge, 2019: 78-90.
[11] Frankenberg-Garcia A. A Corpus Study of Splitting and Joining Sentences in Translation[J]. Corpora, 2019, 14(1): 1-30.
[12] 黄佳跃, 熊德意. 句对齐研究综述[J]. 中文信息学报, 2021, 35(8): 16-27.
[12] (Huang Jiayue, Xiong Deyi. A Survey of Sentence Alignment[J]. Journal of Chinese Information Processing, 2021, 35(8): 16-27.)
[13] 刘文斌, 何彦青, 吴振峰, 等. 基于BERT和多相似度融合的句子对齐方法研究[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[13] (Liu Wenbin, He Yanqing, Wu Zhenfeng, et al. Sentence Alignment Method Based on BERT and Multi-similarity Fusion[J]. Data Analysis and Knowledge Discovery, 2021, 5(7): 48-58.)
[14] Gale W A, Church K W. A Program for Aligning Sentences in Bilingual Corpora[J]. Computational Linguistics, 1993, 19(1): 75-102.
[15] Indurkhya N, Damerau F J. Handbook of Natural Language Processing[M]. The 2nd Edition. Boca Raton: CRC Press, 2010: 367-408.
[16] 熊文新. 英汉环保领域平行语料的句对齐与再对齐[J]. 现代图书情报技术, 2013 (6): 36-41.
[16] (Xiong Wenxin. Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus[J]. New Technology of Library and Information Service, 2013(6): 36-41.)
[17] Varga D, Halácsy P, Kornai A, et al. Parallel Corpora for Medium Density Languages[M]// Recent Advances in Natural Language Processing IV. Amsterdam: John Benjamins Publishing Company, 2007: 247-258.
[18] Sennrich R, Volk M. MT-Based Sentence Alignment for OCR-Generated Parallel Texts[C]// Proceedings of the 9th Conference of the Association for Machine Translation in the Americas:Research Papers. 2010.
[19] Ziemski M, Junczys-Dowmunt M, Pouliquen B. The United Nations Parallel Corpus v1.0[C]// Proceedings of the 10th International Conference on Language Resources and Evaluation. 2016: 3530-3534.
[20] Esplà-Gomis M, Forcada M L, Ramírez-Sánchez G, et al. ParaCrawl: Web-Scale Parallel Corpora for the Languages of the EU[C]// Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks. 2019: 118-119.
[21] Thompson B, Koehn P. Vecalign: Improved Sentence Alignment in Linear Time and Space[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 1342-1348.
[22] Artetxe M, Schwenk H. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 597-610.
[23] Johnson J, Douze M, Jégou H. Billion-Scale Similarity Search with GPUs[J]. IEEE Transactions on Big Data, 2021, 7(3): 535-547.
[24] Zamani H, Faili H, Shakery A. Sentence Alignment Using Local and Global Information[J]. Computer Speech & Language, 2016, 39: 88-107.
[25] 肖桐, 朱靖波. 机器翻译:基础与模型[M]. 北京: 电子工业出版社, 2021.
[25] (Xiao Tong, Zhu Jingbo. Machine Translation: Foundations and Models[M]. Beijing: Publishing House of Electronics Industry, 2021.)
[26] Kocmi T, Federmann C, Grundkiewicz R, et al. To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation[C]// Proceedings of the 6th Conference on Machine Translation. 2021: 478-494.
[27] Freitag M, Rei R, Mathur N, et al. Results of WMT22 Metrics Shared Task: Stop Using BLEU - Neural Metrics Are Better and More Robust[C]// Proceedings of the 7th Conference on Machine Translation. 2022: 46-68.
[28] Popović M. chrF: Character n-gram F-Score for Automatic MT Evaluation[C]// Proceedings of the 10th Workshop on Statistical Machine Translation. 2015: 392-395.
[29] Popović M. chrF++: Words Helping Character n-Grams[C]// Proceedings of the 2nd Conference on Machine Translation. 2017: 612-618.
[30] Rei R, Stewart C, Farinha A C, et al. COMET: A Neural Framework for MT Evaluation[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 2685-2702.
[31] Lample G, Conneau A. Cross-Lingual Language Model Pretraining[OL]. arXiv Preprint, arXiv: 1901.07291.
[32] Vondřička P. Aligning Parallel Texts with InterText[C]// Proceedings of the 9th International Conference on Language Resources and Evaluation. 2014: 1875-1879.
[33] Klein G, Hernandez F, Nguyen V, et al. The OpenNMT Neural Machine Translation Toolkit: 2020 Edition[C]// Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1:Research Track). 2020: 102-109.
[34] Feng F X Y, Yang Y F, Cer D, et al. Language-Agnostic BERT Sentence Embedding[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2022: 878-891.
[35] Wolf T, Debut L, Sanh V, et al. Transformers: State-of-the-Art Natural Language Processing[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing:System Demonstrations. 2020: 38-45.
[36] Xu Y, Max A, Yvon F. Sentence Alignment for Literary Texts: The State-of-the-Art and Beyond[J]. Linguistic Issues in Language Technology, 2015, 12(6): 1-29.
[37] Graën J. Exploiting Alignment in Multiparallel Corpora for Applications in Linguistics and Language Learning[D]. Zurich: University of Zurich, 2018.
[38] Tiedemann J. Bitext Alignment[M]. San Rafael, CA: Morgan & Claypool, 2011.
[39] Khayrallah H, Koehn P. On the Impact of Various Types of Noise on Neural Machine Translation[C]// Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. 2018: 74-83.
[40] Herold C, Rosendahl J, Vanvinckenroye J, et al. Detecting Various Types of Noise for Neural Machine Translation[C]// Findings of the Association for Computational Linguistics:ACL 2022. 2022: 2542-2551.
[1] 刘清民,姚长青,石崇德,温晓洁,孙玥莹. 面向科技文献神经机器翻译词汇表优化研究*[J]. 数据分析与知识发现, 2019, 3(3): 76-82.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn