Please wait a minute...
Data Analysis and Knowledge Discovery  2024, Vol. 8 Issue (6): 56-68    DOI: 10.11925/infotech.2096-3467.2023.0350
Current Issue | Archive | Adv Search |
LingAlign: A Multilingual Sentence Aligner Using Cross-Lingual Sentence Embeddings
Liu Lei1,Liang Maocheng2()
1School of Foreign Studies, Yanshan University, Qinhuangdao 066004, China
2School of Foreign Languages, Beihang University, Beijing 100191, China
Download: PDF (1246 KB)   HTML ( 4
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper develops a multilingual sentence aligner for parallel corpora-based research in digital humanities and machine translation. [Methods] The system first encodes the bitext to be aligned in a shared vector space, and then calculates the semantic relationship between sentences based on modified cosine similarity. Finally, a two-stage dynamic programming algorithm is used to automatically extract parallel sentence pairs. [Results] We use both intrinsic and extrinsic evaluation to calculate the performance of the system. The intrinsic evaluation shows that the average accuracy, recall and F1 values reached 0.950, 0.960 and 0.955. Furthermore, the chrF, chrF++ and COMET scores achieved in the extrinsic evaluation are 55.65, 55.85 and 87.31 respectively. [Limitations] A data capture platform that integrates document alignment and sentence alignment is yet to be developed. [Conclusions] The proposed approach outperforms existing methods in both intrinsic and extrinsic evaluation tasks, which may help to promote the construction of large and high quality multilingual parallel corpora.

Key wordsCross-Lingual Sentence Embeddings      Automatic Sentence Alignment      Neural Machine Translation     
Received: 19 April 2023      Published: 15 March 2024
ZTFLH:  H085  
  TP391  
Fund:National Social Science Fund of China(19BYY082);Key Foundation of Social Science Development of Hebei Province(20230104006);MOE Foundation of Humanities and Social Sciences(17YJC740055)
Corresponding Authors: Liang Maocheng,ORCID:0000-0002-1678-7122,E-mail:frankliang0086@163.com。   

Cite this article:

Liu Lei, Liang Maocheng. LingAlign: A Multilingual Sentence Aligner Using Cross-Lingual Sentence Embeddings. Data Analysis and Knowledge Discovery, 2024, 8(6): 56-68.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2023.0350     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2024/V8/I6/56

对应形式 汉语原文 英语译文
一对一 s1:“一场梦,醒了而已。” t1:“I’m just waking up from a dream.”
一对多 s2:叶文洁说完又笑了笑,抱着那摞复印纸和信封走出了办公室。 t2:Ye smiled again.
t3:She took the stack of photocopies and the envelope and left the office.
多对多 s3:她回到住处,取了饭盒去食堂,才发现只剩下馒头和咸菜了。 t4:She went back to her room,picked up her lunch box,and went to the cafeteria.
s4:食堂的人又没好气地告诉她要关门了,她只好端着饭盒走了出来,走到那道悬崖前,坐在草地上啃着凉馒头。 t5:Only mantou buns and pickles were left,and the cafeteria workers told her impatiently that they were closing.
t6:So she had no choice but to carry her lunch box outside and walk next to the lip of the cliff,where she sat down on the grass to chew the cold mantou.
Examples of Sentence Alignment Types
系统 数据集 语种 准确率P 召回率R F1
Gale-Church? Berg 德-法 0.67 0.68 0.68
Hunalign* Berg 德-法 0.61 0.73 0.66
Bleualign? Berg 德-法 0.83 0.78 0.81
Vecalign* Berg 德-法 0.89 0.90 0.90
Comparative Performance of Existing Sentence Aligners
Two-Step Dynamic Programming Algorithm
Comparision of Sentence Alignment Based on Traditional and Modified Cosine Similarity
COMET Model for MT Quality Evaluation
评测
数据
语料来源 原文语种 原文句数 原文字数 译文语种 译文句数 译文字数
Berg-Dev 瑞士SAC
登山协会年刊
德语 468 8 039 法语 554 10 150
Berg 瑞士SAC
登山协会年刊
德语 991 16 577 法语 1 011 19 718
Gov 《习近平谈
治国理政》
第一卷
汉语 1 797 37 391 英语 2 252 53 772
德语 2 400 54 986
法语 2 217 64 544
日语 2 037 73 035
TB 《三体I:
地球往事》
汉语 1 752 33 577 英语 2 890 40 058
德语 3 105 40 491
法语 2 764 48 092
意语 2 727 40 546
An Overview of Intrinsic Evaluation Dataset
句对类型 频率分布 Berg Gov TB 总计
德-法 汉-德 汉-法 汉-日 汉-英 汉-德 汉-法 汉-意 汉-英
1∶0 & 0∶1 频数 58 5 5 1 9 24 19 19 22 162
百分比 6.3% 0.3% 0.3% 0.1% 0.5% 1.5% 1.2% 1.2% 1.3% 1.1%
1∶1 频数 678 1 284 1 310 1 547 1 375 687 743 759 768 9 151
百分比 74.0% 72.8% 75.4% 87.4% 77.8% 42.1% 46.6% 47.3% 46.7% 63.4%
1∶2 & 2∶1 频数 145 342 334 180 289 508 483 522 517 3 320
百分比 15.8% 19.4% 19.2% 10.2% 16.3% 31.1% 30.3% 32.5% 31.4% 23.0%
其他 频数 35 133 88 43 95 414 349 305 339 1 801
百分比 3.8% 7.5% 5.1% 2.4% 5.4% 25.4% 21.9% 19.0% 20.6% 12.5%
总计 频数 916 1 764 1 737 1 771 1 768 1 633 1 594 1 605 1 646 14 434
Distribution of Alignment Types in the Manually Aligned Corpus
评测数据 语料来源 原文语种 原文句数 原文字数 译文语种 译文句数 译文字数
Sub-Train 爱奇艺视频
海外版网站
汉语 183 078 746 126 越南语 204 494 962 833
Sub-Dev 爱奇艺视频
海外版网站
汉语 2 529 11 176 越南语 2 529 13 395
Sub-Test 爱奇艺视频
海外版网站
汉语 2 856 11 740 越南语 2 856 13 878
An Overview of Extrinsic Evaluation Dataset
Experimental Design
系统 指标 Berg Gov TB
德-法 汉-德 汉-法 汉-日 汉-英 平均值 汉-德 汉-法 汉-意 汉-英 平均值
Gale-Church P 0.672 0.682 0.764 0.891 0.736 0.768 0.177 0.317 0.327 0.311 0.283
R 0.685 0.694 0.770 0.892 0.741 0.774 0.204 0.352 0.363 0.343 0.316
F1 0.678 0.688 0.767 0.891 0.739 0.771 0.190 0.334 0.344 0.326 0.299
Bleualign P 0.851 0.791 0.848 0.914 0.848 0.852 0.502 0.596 0.618 0.688 0.601
R 0.816 0.715 0.801 0.887 0.803 0.802 0.380 0.530 0.563 0.638 0.528
F1 0.833 0.751 0.824 0.900 0.825 0.826 0.433 0.561 0.589 0.662 0.561
Hunalign P 0.798 0.789 0.843 0.900 0.844 0.844 0.464 0.538 0.563 0.620 0.546
R 0.847 0.849 0.886 0.939 0.881 0.889 0.583 0.646 0.666 0.710 0.651
F1 0.822 0.818 0.864 0.919 0.862 0.866 0.517 0.587 0.610 0.662 0.594
Vecalign P 0.899 0.957 0.962 0.977 0.957 0.963 0.855 0.867 0.872 0.919 0.878
R 0.904 0.961 0.966 0.972 0.957 0.964 0.886 0.898 0.902 0.939 0.906
F1 0.902 0.959 0.964 0.975 0.957 0.964 0.870 0.882 0.887 0.929 0.892
LingAlign P 0.941 0.975 0.982 0.992 0.974 0.981 0.910 0.913 0.915 0.933 0.918
R 0.943 0.979 0.982 0.994 0.974 0.982 0.926 0.937 0.939 0.951 0.938
F1 0.942 0.977 0.982 0.993 0.974 0.982 0.918 0.925 0.927 0.942 0.928
Intrinsic Evaluation Results
错误类型 原文 译文
增译减译 s1:“在运河两岸立两根柱子,柱子之间平行地扯上许多细丝,间距半米左右,这些细丝是汪教授他们制造出来的那种叫‘飞刃’的纳米材料。” t1:“We set up two pillars on the shores of the canal,and then between them we string many parallel,thin filaments,about half a meter apart.
t2:The filaments should be made from the nanomaterial called ‘Flying Blade,’ developed by Professor Wang.
t3:A very appropriate name,in this case.”
交叉对齐 s1:“哈哈哈,又放倒了一个!” t1:Wang’s cries were interrupted by laughter.
s2:汪淼的哭泣被身后的一阵笑声打断,他扭头一看,大史站在那里,嘴里吐出一口白烟。 t2:Hahaha,another one bites the dust!
t3:He turned around.
t4:Captain Shi Qiang stood there,blowing out a mouthful of white smoke.
Error Analysis of Automatic Alignments
系统 训练数据 评估指标
句对类型 句对
数量
chrF chrF++ COMET
Gale-
Church
1∶1 156 598 53.61 53.88 86.45
1∶1+1∶2+2∶1 181 377 54.84 55.03 86.67
1∶1+1∶2+2∶1+其他类型 181 386 54.48 54.68 86.68
Bleualign 1∶1 148 635 54.76 54.97 86.69
1∶1+1∶2+2∶1 162 337 54.84 55.02 86.96
1∶1+1∶2+2∶1+其他类型 163 236 54.75 54.97 86.94
Hunalign 1∶1 166 843 54.97 55.19 87.18
1∶1+1∶2+2∶1 178 387 55.10 55.33 87.03
1∶1+1∶2+2∶1+其他类型 178 803 55.03 55.25 87.09
Vecalign 1∶1 154 337 54.73 54.96 87.09
1∶1+1∶2+2∶1 172 414 55.05 55.27 87.16
1∶1+1∶2+2∶1+其他类型 176 998 55.35 55.53 87.17
LingAlign 1∶1 155 558 55.12 55.34 87.16
1∶1+1∶2+2∶1 170 067 55.42 55.65 87.19
1∶1+1∶2+2∶1+其他类型 176 857 55.65 55.85 87.31
Extrinsic Evaluation Results
系统 语种 对齐时间/s 平行句对数量
Gale-Church 汉-越 25.4 181 386
Bleualign 汉-越 274.6 163 236
Hunalign 汉-越 26.3 178 803
Vecalign 汉-越 142.6 176 998
LingAlign 汉-越 38.7 176 857
Running Speeds of Different Sentence Aligners
[1] 常宝宝, 俞士汶. 语料库技术及其应用[J]. 外语研究, 2009 (5): 43-51.
[1] (Chang Baobao, Yu Shiwen. Corpus Technology and Its Application[J]. Foreign Languages Research, 2009 (5): 43-51.)
[2] 梁茂成. 什么是语料库语言学[M]. 上海: 上海外语教育出版社, 2016.
[2] (Liang Maocheng. What is Corpus Linguistics?[M]. Shanghai: Shanghai Foreign Language Education Press, 2016.)
[3] 王克非. 语料库翻译学探索[M]. 上海: 上海交通大学出版社, 2012.
[3] (Wang Kefei. Exploring Corpus-based Translation Studies[M]. Shanghai: Shanghai Jiao Tong University Press, 2012.)
[4] Zanettin F. Translation-Driven Corpora: Corpus Resources for Descriptive and Applied Translation Studies[M]. London: Routledge, 2014.
[5] Koehn P. Neural Machine Translation[M]. Cambridge: Cambridge University Press, 2020.
[6] 李晓倩, 胡开宝. 《习近平谈治国理政》多语平行语料库的建设与应用[J]. 外语电化教学, 2021 (3): 83-88, 13.
[6] (Li Xiaoqian, Hu Kaibao. The Multilingual Parallel Corpus of Xi Jinping: The Governance of China: Compilation and Applications[J]. Technology Enhanced Foreign Language Education, 2021 (3): 83-88, 13.)
[7] 梁继文, 江川, 王东波. 基于多特征融合的先秦典籍汉英句子对齐研究[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
[7] (Liang Jiwen, Jiang Chuan, Wang Dongbo. Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion[J]. Data Analysis and Knowledge Discovery, 2020, 4(9): 123-132.)
[8] 王克非. 以汉语为中心语的多语汉外平行语料库集群的研制与应用[J]. 外语教学, 2022, 43(6): 1-7.
[8] (Wang Kefei. Development and Application of a Multilingual Sino-Foreign Parallel Corpora Group with Chinese as the Pivot Language[J]. Foreign Language Education, 2022, 43(6): 1-7.)
[9] Goyal N, Gao C, Chaudhary V, et al. The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation[J]. Transactions of the Association for Computational Linguistics, 2022, 10: 522-538.
[10] Simard M. Building and Using Parallel Text for Translation[M]// The Routledge Handbook of Translation and Technology. London: Routledge, 2019: 78-90.
[11] Frankenberg-Garcia A. A Corpus Study of Splitting and Joining Sentences in Translation[J]. Corpora, 2019, 14(1): 1-30.
[12] 黄佳跃, 熊德意. 句对齐研究综述[J]. 中文信息学报, 2021, 35(8): 16-27.
[12] (Huang Jiayue, Xiong Deyi. A Survey of Sentence Alignment[J]. Journal of Chinese Information Processing, 2021, 35(8): 16-27.)
[13] 刘文斌, 何彦青, 吴振峰, 等. 基于BERT和多相似度融合的句子对齐方法研究[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[13] (Liu Wenbin, He Yanqing, Wu Zhenfeng, et al. Sentence Alignment Method Based on BERT and Multi-similarity Fusion[J]. Data Analysis and Knowledge Discovery, 2021, 5(7): 48-58.)
[14] Gale W A, Church K W. A Program for Aligning Sentences in Bilingual Corpora[J]. Computational Linguistics, 1993, 19(1): 75-102.
[15] Indurkhya N, Damerau F J. Handbook of Natural Language Processing[M]. The 2nd Edition. Boca Raton: CRC Press, 2010: 367-408.
[16] 熊文新. 英汉环保领域平行语料的句对齐与再对齐[J]. 现代图书情报技术, 2013 (6): 36-41.
[16] (Xiong Wenxin. Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus[J]. New Technology of Library and Information Service, 2013(6): 36-41.)
[17] Varga D, Halácsy P, Kornai A, et al. Parallel Corpora for Medium Density Languages[M]// Recent Advances in Natural Language Processing IV. Amsterdam: John Benjamins Publishing Company, 2007: 247-258.
[18] Sennrich R, Volk M. MT-Based Sentence Alignment for OCR-Generated Parallel Texts[C]// Proceedings of the 9th Conference of the Association for Machine Translation in the Americas:Research Papers. 2010.
[19] Ziemski M, Junczys-Dowmunt M, Pouliquen B. The United Nations Parallel Corpus v1.0[C]// Proceedings of the 10th International Conference on Language Resources and Evaluation. 2016: 3530-3534.
[20] Esplà-Gomis M, Forcada M L, Ramírez-Sánchez G, et al. ParaCrawl: Web-Scale Parallel Corpora for the Languages of the EU[C]// Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks. 2019: 118-119.
[21] Thompson B, Koehn P. Vecalign: Improved Sentence Alignment in Linear Time and Space[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 1342-1348.
[22] Artetxe M, Schwenk H. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 597-610.
[23] Johnson J, Douze M, Jégou H. Billion-Scale Similarity Search with GPUs[J]. IEEE Transactions on Big Data, 2021, 7(3): 535-547.
[24] Zamani H, Faili H, Shakery A. Sentence Alignment Using Local and Global Information[J]. Computer Speech & Language, 2016, 39: 88-107.
[25] 肖桐, 朱靖波. 机器翻译:基础与模型[M]. 北京: 电子工业出版社, 2021.
[25] (Xiao Tong, Zhu Jingbo. Machine Translation: Foundations and Models[M]. Beijing: Publishing House of Electronics Industry, 2021.)
[26] Kocmi T, Federmann C, Grundkiewicz R, et al. To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation[C]// Proceedings of the 6th Conference on Machine Translation. 2021: 478-494.
[27] Freitag M, Rei R, Mathur N, et al. Results of WMT22 Metrics Shared Task: Stop Using BLEU - Neural Metrics Are Better and More Robust[C]// Proceedings of the 7th Conference on Machine Translation. 2022: 46-68.
[28] Popović M. chrF: Character n-gram F-Score for Automatic MT Evaluation[C]// Proceedings of the 10th Workshop on Statistical Machine Translation. 2015: 392-395.
[29] Popović M. chrF++: Words Helping Character n-Grams[C]// Proceedings of the 2nd Conference on Machine Translation. 2017: 612-618.
[30] Rei R, Stewart C, Farinha A C, et al. COMET: A Neural Framework for MT Evaluation[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 2685-2702.
[31] Lample G, Conneau A. Cross-Lingual Language Model Pretraining[OL]. arXiv Preprint, arXiv: 1901.07291.
[32] Vondřička P. Aligning Parallel Texts with InterText[C]// Proceedings of the 9th International Conference on Language Resources and Evaluation. 2014: 1875-1879.
[33] Klein G, Hernandez F, Nguyen V, et al. The OpenNMT Neural Machine Translation Toolkit: 2020 Edition[C]// Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1:Research Track). 2020: 102-109.
[34] Feng F X Y, Yang Y F, Cer D, et al. Language-Agnostic BERT Sentence Embedding[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2022: 878-891.
[35] Wolf T, Debut L, Sanh V, et al. Transformers: State-of-the-Art Natural Language Processing[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing:System Demonstrations. 2020: 38-45.
[36] Xu Y, Max A, Yvon F. Sentence Alignment for Literary Texts: The State-of-the-Art and Beyond[J]. Linguistic Issues in Language Technology, 2015, 12(6): 1-29.
[37] Graën J. Exploiting Alignment in Multiparallel Corpora for Applications in Linguistics and Language Learning[D]. Zurich: University of Zurich, 2018.
[38] Tiedemann J. Bitext Alignment[M]. San Rafael, CA: Morgan & Claypool, 2011.
[39] Khayrallah H, Koehn P. On the Impact of Various Types of Noise on Neural Machine Translation[C]// Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. 2018: 74-83.
[40] Herold C, Rosendahl J, Vanvinckenroye J, et al. Detecting Various Types of Noise for Neural Machine Translation[C]// Findings of the Association for Computational Linguistics:ACL 2022. 2022: 2542-2551.
[1] Qingmin Liu,Changqing Yao,Chongde Shi,Xiaojie Wen,Yueying Sun. Vocabulary Optimization of Neural Machine Translation for Scientific and Technical Document[J]. 数据分析与知识发现, 2019, 3(3): 76-82.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn