Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (9): 123-132    DOI: 10.11925/infotech.2096-3467.2019.0268
Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion
Liang Jiwen1,Jiang Chuan2,Wang Dongbo2,3()
1School of Information Management, Nanjing University, Nanjing 210023, China
2College of Information Science & Technology, Nanjing Agricultural University, Nanjing 210095, China
3Facultair Onderzoekscentrum ECOOM, KU Leuven, Leuven B-3000, Belgium
[Objective] This paper proposes a method automatically aligning Chinese sentences from Pre-Qin Literature with their English translations, aiming to construct bilingual sentence-level parallel corpus and support cross-language retrieval.[Methods] First, we modified classification method for parallel sentence pairs to align bilingual sentences from historical literature. Based on the characteristics of bilingual corpus, we retrieved features of bilingual sentence pairs. Finally, with “sequence labeling” and “overall classification”, we identified aligned pairs from candidate sentences.[Results] In the sequence labeling experiment, the LSTM-CRF model yielded the best performance with its F value reaching 92.67%. In the overall classification experiment, the SVM had the best results with a F value of 90.63%. In the experiment combining all four features, the F value was 91.01%.[Limitations] The corpus size needs to be expanded.[Conclusions] The LSTM-CRF model with four features could effectively align ancient Chinese sentences with their English translations.

Key wordsSentence Alignment      Multilingual Information Processing      Chinese-English Parallel Corpus      Pre-Qin Literature      Digital Humanities     
Received: 11 March 2019      Published: 17 June 2020
ZTFLH:  G351  
Corresponding Authors: Wang Dongbo     E-mail:

Liang Jiwen,Jiang Chuan,Wang Dongbo. Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion. Data Analysis and Knowledge Discovery, 2020, 4(9): 123-132.

LSTM-CRF Model Sample
Bilingual Length Distribution of Aligned Sentence Pairs
Relational Distribution of the Length of Two Sentences of Aligned Sentence Pairs
模型 P R F Kappa
SVM 90.19% 90.02% 90.63% 80.02%
MaxEnt 98.57% 76.01% 85.83% 73.59%
MLP 88.18% 89.25% 88.71% 76.19%
Experimental Results of the Overall Classification Model
模型 P R F
LSTM 84.07% 93.64% 88.60%
LSTM-CRF 97.35% 88.42% 92.67%
GRU 81.68% 78.65% 80.61%
GRU-CRF 90.35% 89.60% 89.97%
Experimental Results of Sequence Labeling Model
Performance of Feature Combination Sentence Alignment
Experimental Results Between Feature Combination and Single Feature
