Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion
Liang Jiwen1,Jiang Chuan2,Wang Dongbo2,3()
1School of Information Management, Nanjing University, Nanjing 210023, China 2College of Information Science & Technology, Nanjing Agricultural University, Nanjing 210095, China 3Facultair Onderzoekscentrum ECOOM, KU Leuven, Leuven B-3000, Belgium
[Objective] This paper proposes a method automatically aligning Chinese sentences from Pre-Qin Literature with their English translations, aiming to construct bilingual sentence-level parallel corpus and support cross-language retrieval.[Methods] First, we modified classification method for parallel sentence pairs to align bilingual sentences from historical literature. Based on the characteristics of bilingual corpus, we retrieved features of bilingual sentence pairs. Finally, with “sequence labeling” and “overall classification”, we identified aligned pairs from candidate sentences.[Results] In the sequence labeling experiment, the LSTM-CRF model yielded the best performance with its F value reaching 92.67%. In the overall classification experiment, the SVM had the best results with a F value of 90.63%. In the experiment combining all four features, the F value was 91.01%.[Limitations] The corpus size needs to be expanded.[Conclusions] The LSTM-CRF model with four features could effectively align ancient Chinese sentences with their English translations.
梁继文,江川,王东波. 基于多特征融合的先秦典籍汉英句子对齐研究*[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
Liang Jiwen,Jiang Chuan,Wang Dongbo. Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion. Data Analysis and Knowledge Discovery, 2020, 4(9): 123-132.
Guo M, Shen Q L, Yang Y F, et al. Effective Parallel Corpus Mining Using Bilingual Sentence Embeddings[OL]. arXiv Preprint, arXiv:1807.11906.
[2]
Brown P F, Lai J C, Mercer R L. Aligning Sentences in Parallel Corpora[C] //Proceedings of the 29th Annual Meeting on Association for Computational Linguistics. 1991: 169-176.
[3]
Gale W A, Church K W. A Program for Aligning Sentences in Bilingual Corpora[J]. Computational Linguistics, 1993,19(1):75-102.
( Zhang Xia, Zan Hongying, Zhang Enzhan. Study on Length Computation Method of Chinese-English Sentence Alignment[J]. Computer Engineering and Design, 2009,30(18):4356-4358.)
[5]
Chuang T C, Yeh K C. Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria[J]. International Journal of Computational Linguistics & Chinese Language Processing, 2005,10(1):95-122.
[6]
Simard M, Foster G F, Isabelle P. Abstract Using Cognates to Align Sentences in Bilingual Corpora[C] //Proceedings of the 4th International Congress on Theoretical & Methodological Issues in Machine Translation. 1992: 67-81.
[7]
Church K W. Char_align: A Program for Aligning Parallel Texts at the Character Level[C] //Proceedings of the 31st Annual Meeting on Association for Computational Linguistics. 1993: 1-8.
[8]
Melamed I D. Bitext Maps and Alignment via Pattern Recognition[J]. Computational Linguistics, 1999,25(1):107-130.
[9]
Kay M, Röscheisen M. Text-translation Alignment[J]. Computational Linguistics, 1993,19(1):121-142.
[10]
Ma X Y. Champollion: A Robust Parallel Text Sentence Aligner[C] //Proceedings of LREC-2006. 2006: 489-492.
[11]
李秀英. 基于历史典籍双语平行语料库的术语对齐研究[D]. 大连:大连理工大学, 2010.
[11]
( Li Xiuying. Term Translation Pair Alignment Based on a Bilingual Parallel Corpus of Chinese Historical Classics[D]. Dalian: Dalian University of Technology, 2010.)
( Li Wen. Research on Alignment of Ancient Chinese Sentences to Modern Ones[C] //Proceedings of China Workshop on Machine Translation, Hefei, China. 2015: 90-96.)
[13]
Wu D K. Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria[C] //Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics. 1994: 80-87.
[14]
Moore R C. Fast and Accurate Sentence Alignment of Bilingual Corpora[C] //Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users. 2002: 135-144.
[15]
Varga D, Halácsy P, Kornai A, et al. Parallel Corpora for Medium Density Languages[J]. Amsterdam Studies in the Theory and History of Linguistic Science Series 4, 2007. DOI: 10.1075/cilt.292.32var.
[16]
Braune F, Fraser A. Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora[C] // Proceedings of the 23rd International Conference on Computational Linguistics: Posters. 2010: 81-89.
[17]
Trieu H L, Nguyen P T, Nguyen K A. Improving Moore’s Sentence Alignment Method Using Bilingual Word Clustering[C] //Proceedings of the 5th International Conference KSE. 2014: 149-160.
[18]
Trieu H L, Nguyen P T, Nguyen L M. A New Feature to Improve Moore’s Sentence Alignment Method[J]. VNU Journal of Science: Computer Science and Communication Engineering, 2015,31(1):32-44.
( Guo Rui Song Jihua Liao Min. Ancient Sentence Search Based on Sentence Auto Alignment in Parallel Corpus of Ancient and Modern Chinese[J]. Journal of Chinese Information Processing, 2008,22(2):87-91,105.)
( Qian Liping, Zhao Tiejun, Yang Moyun, et al. Translation-based Automatic Alignment of English and Chinese Parallel Corpora[J]. Computer Engineering and Applications, 2000,36(12):123-125.)
( Zhang Yan, Kashioka Hideki. Aligning Sentences in Chinese-English Corpora with Extended Length-based Approach[J]. Journal of Chinese Information Processing, 2005,19(5):33-38, 60.)
( Li Wen’gang, Zhou Jie, Yang Baoqun. Improvement of Bilingual Sentence Alignment Method Based on Sentence Length and Location Information with Bidirectional Dictionary[J]. Modern Electronics Technique, 2011,34(14):25-27.)
[25]
Sennrich R, Volk M. MT-based Sentence Alignment for OCR-generated Parallel Texts[C] //Proceedings of the 9th Conference of the Association for Machine Translation in the Americas (AMTA 2010). 2010.
[26]
Fattah M A, Bracewell D B, Ren F J, et al. Sentence Alignment Using P-NNT and GMM[J]. Computer Speech and Language, 2007,21(4):594-608.
doi: 10.1016/j.csl.2007.01.002
[27]
Fattah M A. The Use of MSVM and HMM for Sentence Alignment[J]. Journal of Information Processing Systems, 2012,8(2):301-314.
doi: 10.3745/JIPS.2012.8.2.301
( Liu Ying, Wang Nan. Comparison of Clause Alignment Based on Maximum Entropy Model and Back Propagation Neural Network Model[J]. Computer Engineering and Applications, 2015,51(7):112-117.)
[30]
让子强. 汉老双语句子对齐方法研究[D]. 昆明: 昆明理工大学, 2017.
[30]
( Rang Ziqiang. Research on Chinese-Lao Bilingual Sentence Alignment Methods[D]. Kunming: Kunming University of Science and Technology, 2017.)
( Chen Xiang, Lin Hongfei, Yang Zhihao. Sentence Alignment for Biomedicine Texts Based on Gaussian Mixture Model[J]. Journal of Chinese Information Processing, 2010,24(4):68-73.)
[32]
Cortes C, Vapnik V. Support-vector Networks[J]. Machine Learning, 1995,20(3):273-297.
[33]
Jaynes E T. On the Rationale of Maximum-entropy Methods[J]. Proceedings of the IEEE, 1982,70(9):939-952.
doi: 10.1109/PROC.1982.12425
[34]
Grégoire F, Langlais P. A Deep Neural Network Approach to Parallel Sentence Extraction[OL]. arXiv Preprint, arXiv:1709.09783.
[35]
Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997,9(8):1735-1780.
pmid: 9377276
[36]
Cho K, Van Merrienboer B, Gulcehre C, et al. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation[OL]. arXiv Preprint, arXiv: 1406.1078.
[37]
Hensman P, Masko D. The Impact of Imbalanced Training Data for Convolutional Neural Networks[EB/OL]. [2019-03-02]. https://www.kth.se/social/files/588617ebf2765401cfcc478c/PHensman DMasko_dkand15.pdf.