[Objective] This paper develops a multilingual sentence aligner for parallel corpora-based research in digital humanities and machine translation. [Methods] The system first encodes the bitext to be aligned in a shared vector space, and then calculates the semantic relationship between sentences based on modified cosine similarity. Finally, a two-stage dynamic programming algorithm is used to automatically extract parallel sentence pairs. [Results] We use both intrinsic and extrinsic evaluation to calculate the performance of the system. The intrinsic evaluation shows that the average accuracy, recall and F1 values reached 0.950, 0.960 and 0.955. Furthermore, the chrF, chrF++ and COMET scores achieved in the extrinsic evaluation are 55.65, 55.85 and 87.31 respectively. [Limitations] A data capture platform that integrates document alignment and sentence alignment is yet to be developed. [Conclusions] The proposed approach outperforms existing methods in both intrinsic and extrinsic evaluation tasks, which may help to promote the construction of large and high quality multilingual parallel corpora.
t5:Only mantou buns and pickles were left,and the cafeteria workers told her impatiently that they were closing.
t6:So she had no choice but to carry her lunch box outside and walk next to the lip of the cliff,where she sat down on the grass to chew the cold mantou.
(Li Xiaoqian, Hu Kaibao. The Multilingual Parallel Corpus of Xi Jinping: The Governance of China: Compilation and Applications[J]. Technology Enhanced Foreign Language Education, 2021 (3): 83-88, 13.)
(Liang Jiwen, Jiang Chuan, Wang Dongbo. Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion[J]. Data Analysis and Knowledge Discovery, 2020, 4(9): 123-132.)
(Wang Kefei. Development and Application of a Multilingual Sino-Foreign Parallel Corpora Group with Chinese as the Pivot Language[J]. Foreign Language Education, 2022, 43(6): 1-7.)
[9]
Goyal N, Gao C, Chaudhary V, et al. The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation[J]. Transactions of the Association for Computational Linguistics, 2022, 10: 522-538.
[10]
Simard M. Building and Using Parallel Text for Translation[M]// The Routledge Handbook of Translation and Technology. London: Routledge, 2019: 78-90.
[11]
Frankenberg-Garcia A. A Corpus Study of Splitting and Joining Sentences in Translation[J]. Corpora, 2019, 14(1): 1-30.
[12]
黄佳跃, 熊德意. 句对齐研究综述[J]. 中文信息学报, 2021, 35(8): 16-27.
[12]
(Huang Jiayue, Xiong Deyi. A Survey of Sentence Alignment[J]. Journal of Chinese Information Processing, 2021, 35(8): 16-27.)
(Liu Wenbin, He Yanqing, Wu Zhenfeng, et al. Sentence Alignment Method Based on BERT and Multi-similarity Fusion[J]. Data Analysis and Knowledge Discovery, 2021, 5(7): 48-58.)
[14]
Gale W A, Church K W. A Program for Aligning Sentences in Bilingual Corpora[J]. Computational Linguistics, 1993, 19(1): 75-102.
[15]
Indurkhya N, Damerau F J. Handbook of Natural Language Processing[M]. The 2nd Edition. Boca Raton: CRC Press, 2010: 367-408.
(Xiong Wenxin. Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus[J]. New Technology of Library and Information Service, 2013(6): 36-41.)
[17]
Varga D, Halácsy P, Kornai A, et al. Parallel Corpora for Medium Density Languages[M]// Recent Advances in Natural Language Processing IV. Amsterdam: John Benjamins Publishing Company, 2007: 247-258.
[18]
Sennrich R, Volk M. MT-Based Sentence Alignment for OCR-Generated Parallel Texts[C]// Proceedings of the 9th Conference of the Association for Machine Translation in the Americas:Research Papers. 2010.
[19]
Ziemski M, Junczys-Dowmunt M, Pouliquen B. The United Nations Parallel Corpus v1.0[C]// Proceedings of the 10th International Conference on Language Resources and Evaluation. 2016: 3530-3534.
[20]
Esplà-Gomis M, Forcada M L, Ramírez-Sánchez G, et al. ParaCrawl: Web-Scale Parallel Corpora for the Languages of the EU[C]// Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks. 2019: 118-119.
[21]
Thompson B, Koehn P. Vecalign: Improved Sentence Alignment in Linear Time and Space[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 1342-1348.
[22]
Artetxe M, Schwenk H. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 597-610.
[23]
Johnson J, Douze M, Jégou H. Billion-Scale Similarity Search with GPUs[J]. IEEE Transactions on Big Data, 2021, 7(3): 535-547.
[24]
Zamani H, Faili H, Shakery A. Sentence Alignment Using Local and Global Information[J]. Computer Speech & Language, 2016, 39: 88-107.
[25]
肖桐, 朱靖波. 机器翻译:基础与模型[M]. 北京: 电子工业出版社, 2021.
[25]
(Xiao Tong, Zhu Jingbo. Machine Translation: Foundations and Models[M]. Beijing: Publishing House of Electronics Industry, 2021.)
[26]
Kocmi T, Federmann C, Grundkiewicz R, et al. To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation[C]// Proceedings of the 6th Conference on Machine Translation. 2021: 478-494.
[27]
Freitag M, Rei R, Mathur N, et al. Results of WMT22 Metrics Shared Task: Stop Using BLEU - Neural Metrics Are Better and More Robust[C]// Proceedings of the 7th Conference on Machine Translation. 2022: 46-68.
[28]
Popović M. chrF: Character n-gram F-Score for Automatic MT Evaluation[C]// Proceedings of the 10th Workshop on Statistical Machine Translation. 2015: 392-395.
[29]
Popović M. chrF++: Words Helping Character n-Grams[C]// Proceedings of the 2nd Conference on Machine Translation. 2017: 612-618.
[30]
Rei R, Stewart C, Farinha A C, et al. COMET: A Neural Framework for MT Evaluation[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 2685-2702.
[31]
Lample G, Conneau A. Cross-Lingual Language Model Pretraining[OL]. arXiv Preprint, arXiv: 1901.07291.
[32]
Vondřička P. Aligning Parallel Texts with InterText[C]// Proceedings of the 9th International Conference on Language Resources and Evaluation. 2014: 1875-1879.
[33]
Klein G, Hernandez F, Nguyen V, et al. The OpenNMT Neural Machine Translation Toolkit: 2020 Edition[C]// Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1:Research Track). 2020: 102-109.
[34]
Feng F X Y, Yang Y F, Cer D, et al. Language-Agnostic BERT Sentence Embedding[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2022: 878-891.
[35]
Wolf T, Debut L, Sanh V, et al. Transformers: State-of-the-Art Natural Language Processing[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing:System Demonstrations. 2020: 38-45.
[36]
Xu Y, Max A, Yvon F. Sentence Alignment for Literary Texts: The State-of-the-Art and Beyond[J]. Linguistic Issues in Language Technology, 2015, 12(6): 1-29.
[37]
Graën J. Exploiting Alignment in Multiparallel Corpora for Applications in Linguistics and Language Learning[D]. Zurich: University of Zurich, 2018.
[38]
Tiedemann J. Bitext Alignment[M]. San Rafael, CA: Morgan & Claypool, 2011.
[39]
Khayrallah H, Koehn P. On the Impact of Various Types of Noise on Neural Machine Translation[C]// Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. 2018: 74-83.
[40]
Herold C, Rosendahl J, Vanvinckenroye J, et al. Detecting Various Types of Noise for Neural Machine Translation[C]// Findings of the Association for Computational Linguistics:ACL 2022. 2022: 2542-2551.