Data Analysis and Knowledge Discovery  2024, Vol. 8 Issue (6): 56-68    DOI: 10.11925/infotech.2096-3467.2023.0350
LingAlign: A Multilingual Sentence Aligner Using Cross-Lingual Sentence Embeddings
Liu Lei1,Liang Maocheng2()
1School of Foreign Studies, Yanshan University, Qinhuangdao 066004, China
2School of Foreign Languages, Beihang University, Beijing 100191, China
[Objective] This paper develops a multilingual sentence aligner for parallel corpora-based research in digital humanities and machine translation. [Methods] The system first encodes the bitext to be aligned in a shared vector space, and then calculates the semantic relationship between sentences based on modified cosine similarity. Finally, a two-stage dynamic programming algorithm is used to automatically extract parallel sentence pairs. [Results] We use both intrinsic and extrinsic evaluation to calculate the performance of the system. The intrinsic evaluation shows that the average accuracy, recall and F1 values reached 0.950, 0.960 and 0.955. Furthermore, the chrF, chrF++ and COMET scores achieved in the extrinsic evaluation are 55.65, 55.85 and 87.31 respectively. [Limitations] A data capture platform that integrates document alignment and sentence alignment is yet to be developed. [Conclusions] The proposed approach outperforms existing methods in both intrinsic and extrinsic evaluation tasks, which may help to promote the construction of large and high quality multilingual parallel corpora.

Key wordsCross-Lingual Sentence Embeddings      Automatic Sentence Alignment      Neural Machine Translation     
Received: 19 April 2023      Published: 15 March 2024
ZTFLH:  H085  
Fund:National Social Science Fund of China(19BYY082);Key Foundation of Social Science Development of Hebei Province(20230104006);MOE Foundation of Humanities and Social Sciences(17YJC740055)
Corresponding Authors: Liang Maocheng,ORCID:0000-0002-1678-7122,。   

Cite this article:

Liu Lei, Liang Maocheng. LingAlign: A Multilingual Sentence Aligner Using Cross-Lingual Sentence Embeddings. Data Analysis and Knowledge Discovery, 2024, 8(6): 56-68.

对应形式 汉语原文 英语译文
一对一 s1:“一场梦,醒了而已。” t1:“I’m just waking up from a dream.”
一对多 s2:叶文洁说完又笑了笑,抱着那摞复印纸和信封走出了办公室。 t2:Ye smiled again.
t3:She took the stack of photocopies and the envelope and left the office.
多对多 s3:她回到住处,取了饭盒去食堂,才发现只剩下馒头和咸菜了。 t4:She went back to her room,picked up her lunch box,and went to the cafeteria.
s4:食堂的人又没好气地告诉她要关门了,她只好端着饭盒走了出来,走到那道悬崖前,坐在草地上啃着凉馒头。 t5:Only mantou buns and pickles were left,and the cafeteria workers told her impatiently that they were closing.
t6:So she had no choice but to carry her lunch box outside and walk next to the lip of the cliff,where she sat down on the grass to chew the cold mantou.
Examples of Sentence Alignment Types
系统 数据集 语种 准确率P 召回率R F1
Gale-Church? Berg 德-法 0.67 0.68 0.68
Hunalign* Berg 德-法 0.61 0.73 0.66
Bleualign? Berg 德-法 0.83 0.78 0.81
Vecalign* Berg 德-法 0.89 0.90 0.90
Comparative Performance of Existing Sentence Aligners
Two-Step Dynamic Programming Algorithm
Comparision of Sentence Alignment Based on Traditional and Modified Cosine Similarity
COMET Model for MT Quality Evaluation
语料来源 原文语种 原文句数 原文字数 译文语种 译文句数 译文字数
Berg-Dev 瑞士SAC
德语 468 8 039 法语 554 10 150
Berg 瑞士SAC
德语 991 16 577 法语 1 011 19 718
Gov 《习近平谈
汉语 1 797 37 391 英语 2 252 53 772
德语 2 400 54 986
法语 2 217 64 544
日语 2 037 73 035
TB 《三体I:
汉语 1 752 33 577 英语 2 890 40 058
德语 3 105 40 491
法语 2 764 48 092
意语 2 727 40 546
An Overview of Intrinsic Evaluation Dataset
句对类型 频率分布 Berg Gov TB 总计
德-法 汉-德 汉-法 汉-日 汉-英 汉-德 汉-法 汉-意 汉-英
1∶0 & 0∶1 频数 58 5 5 1 9 24 19 19 22 162
百分比 6.3% 0.3% 0.3% 0.1% 0.5% 1.5% 1.2% 1.2% 1.3% 1.1%
1∶1 频数 678 1 284 1 310 1 547 1 375 687 743 759 768 9 151
百分比 74.0% 72.8% 75.4% 87.4% 77.8% 42.1% 46.6% 47.3% 46.7% 63.4%
1∶2 & 2∶1 频数 145 342 334 180 289 508 483 522 517 3 320
百分比 15.8% 19.4% 19.2% 10.2% 16.3% 31.1% 30.3% 32.5% 31.4% 23.0%
其他 频数 35 133 88 43 95 414 349 305 339 1 801
百分比 3.8% 7.5% 5.1% 2.4% 5.4% 25.4% 21.9% 19.0% 20.6% 12.5%
总计 频数 916 1 764 1 737 1 771 1 768 1 633 1 594 1 605 1 646 14 434
Distribution of Alignment Types in the Manually Aligned Corpus
评测数据 语料来源 原文语种 原文句数 原文字数 译文语种 译文句数 译文字数
Sub-Train 爱奇艺视频
汉语 183 078 746 126 越南语 204 494 962 833
Sub-Dev 爱奇艺视频
汉语 2 529 11 176 越南语 2 529 13 395
Sub-Test 爱奇艺视频
汉语 2 856 11 740 越南语 2 856 13 878
An Overview of Extrinsic Evaluation Dataset
Experimental Design
系统 指标 Berg Gov TB
德-法 汉-德 汉-法 汉-日 汉-英 平均值 汉-德 汉-法 汉-意 汉-英 平均值
Gale-Church P 0.672 0.682 0.764 0.891 0.736 0.768 0.177 0.317 0.327 0.311 0.283
R 0.685 0.694 0.770 0.892 0.741 0.774 0.204 0.352 0.363 0.343 0.316
F1 0.678 0.688 0.767 0.891 0.739 0.771 0.190 0.334 0.344 0.326 0.299
Bleualign P 0.851 0.791 0.848 0.914 0.848 0.852 0.502 0.596 0.618 0.688 0.601
R 0.816 0.715 0.801 0.887 0.803 0.802 0.380 0.530 0.563 0.638 0.528
F1 0.833 0.751 0.824 0.900 0.825 0.826 0.433 0.561 0.589 0.662 0.561
Hunalign P 0.798 0.789 0.843 0.900 0.844 0.844 0.464 0.538 0.563 0.620 0.546
R 0.847 0.849 0.886 0.939 0.881 0.889 0.583 0.646 0.666 0.710 0.651
F1 0.822 0.818 0.864 0.919 0.862 0.866 0.517 0.587 0.610 0.662 0.594
Vecalign P 0.899 0.957 0.962 0.977 0.957 0.963 0.855 0.867 0.872 0.919 0.878
R 0.904 0.961 0.966 0.972 0.957 0.964 0.886 0.898 0.902 0.939 0.906
F1 0.902 0.959 0.964 0.975 0.957 0.964 0.870 0.882 0.887 0.929 0.892
LingAlign P 0.941 0.975 0.982 0.992 0.974 0.981 0.910 0.913 0.915 0.933 0.918
R 0.943 0.979 0.982 0.994 0.974 0.982 0.926 0.937 0.939 0.951 0.938
F1 0.942 0.977 0.982 0.993 0.974 0.982 0.918 0.925 0.927 0.942 0.928
Intrinsic Evaluation Results
错误类型 原文 译文
增译减译 s1:“在运河两岸立两根柱子,柱子之间平行地扯上许多细丝,间距半米左右,这些细丝是汪教授他们制造出来的那种叫‘飞刃’的纳米材料。” t1:“We set up two pillars on the shores of the canal,and then between them we string many parallel,thin filaments,about half a meter apart.
t2:The filaments should be made from the nanomaterial called ‘Flying Blade,’ developed by Professor Wang.
t3:A very appropriate name,in this case.”
交叉对齐 s1:“哈哈哈,又放倒了一个!” t1:Wang’s cries were interrupted by laughter.
s2:汪淼的哭泣被身后的一阵笑声打断,他扭头一看,大史站在那里,嘴里吐出一口白烟。 t2:Hahaha,another one bites the dust!
t3:He turned around.
t4:Captain Shi Qiang stood there,blowing out a mouthful of white smoke.
Error Analysis of Automatic Alignments
系统 训练数据 评估指标
句对类型 句对
chrF chrF++ COMET
1∶1 156 598 53.61 53.88 86.45
1∶1+1∶2+2∶1 181 377 54.84 55.03 86.67
1∶1+1∶2+2∶1+其他类型 181 386 54.48 54.68 86.68
Bleualign 1∶1 148 635 54.76 54.97 86.69
1∶1+1∶2+2∶1 162 337 54.84 55.02 86.96
1∶1+1∶2+2∶1+其他类型 163 236 54.75 54.97 86.94
Hunalign 1∶1 166 843 54.97 55.19 87.18
1∶1+1∶2+2∶1 178 387 55.10 55.33 87.03
1∶1+1∶2+2∶1+其他类型 178 803 55.03 55.25 87.09
Vecalign 1∶1 154 337 54.73 54.96 87.09
1∶1+1∶2+2∶1 172 414 55.05 55.27 87.16
1∶1+1∶2+2∶1+其他类型 176 998 55.35 55.53 87.17
LingAlign 1∶1 155 558 55.12 55.34 87.16
1∶1+1∶2+2∶1 170 067 55.42 55.65 87.19
1∶1+1∶2+2∶1+其他类型 176 857 55.65 55.85 87.31
Extrinsic Evaluation Results
系统 语种 对齐时间/s 平行句对数量
Gale-Church 汉-越 25.4 181 386
Bleualign 汉-越 274.6 163 236
Hunalign 汉-越 26.3 178 803
Vecalign 汉-越 142.6 176 998
LingAlign 汉-越 38.7 176 857
Running Speeds of Different Sentence Aligners
