Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (9): 123-132    DOI: 10.11925/infotech.2096-3467.2019.0268
Current Issue | Archive | Adv Search |
Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion
Liang Jiwen1,Jiang Chuan2,Wang Dongbo2,3()
1School of Information Management, Nanjing University, Nanjing 210023, China
2College of Information Science & Technology, Nanjing Agricultural University, Nanjing 210095, China
3Facultair Onderzoekscentrum ECOOM, KU Leuven, Leuven B-3000, Belgium
Download: PDF (1011 KB)   HTML ( 13
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a method automatically aligning Chinese sentences from Pre-Qin Literature with their English translations, aiming to construct bilingual sentence-level parallel corpus and support cross-language retrieval.[Methods] First, we modified classification method for parallel sentence pairs to align bilingual sentences from historical literature. Based on the characteristics of bilingual corpus, we retrieved features of bilingual sentence pairs. Finally, with “sequence labeling” and “overall classification”, we identified aligned pairs from candidate sentences.[Results] In the sequence labeling experiment, the LSTM-CRF model yielded the best performance with its F value reaching 92.67%. In the overall classification experiment, the SVM had the best results with a F value of 90.63%. In the experiment combining all four features, the F value was 91.01%.[Limitations] The corpus size needs to be expanded.[Conclusions] The LSTM-CRF model with four features could effectively align ancient Chinese sentences with their English translations.

Key wordsSentence Alignment      Multilingual Information Processing      Chinese-English Parallel Corpus      Pre-Qin Literature      Digital Humanities     
Received: 11 March 2019      Published: 17 June 2020
ZTFLH:  G351  
Corresponding Authors: Wang Dongbo     E-mail: db.wang@njau.edu.cn

Cite this article:

Liang Jiwen,Jiang Chuan,Wang Dongbo. Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion. Data Analysis and Knowledge Discovery, 2020, 4(9): 123-132.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.0268     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I9/123

LSTM-CRF Model Sample
Bilingual Length Distribution of Aligned Sentence Pairs
Relational Distribution of the Length of Two Sentences of Aligned Sentence Pairs
模型 P R F Kappa
SVM 90.19% 90.02% 90.63% 80.02%
MaxEnt 98.57% 76.01% 85.83% 73.59%
MLP 88.18% 89.25% 88.71% 76.19%
Experimental Results of the Overall Classification Model
模型 P R F
LSTM 84.07% 93.64% 88.60%
LSTM-CRF 97.35% 88.42% 92.67%
GRU 81.68% 78.65% 80.61%
GRU-CRF 90.35% 89.60% 89.97%
Experimental Results of Sequence Labeling Model
Performance of Feature Combination Sentence Alignment
Experimental Results Between Feature Combination and Single Feature
[1] Guo M, Shen Q L, Yang Y F, et al. Effective Parallel Corpus Mining Using Bilingual Sentence Embeddings[OL]. arXiv Preprint, arXiv:1807.11906.
[2] Brown P F, Lai J C, Mercer R L. Aligning Sentences in Parallel Corpora[C] //Proceedings of the 29th Annual Meeting on Association for Computational Linguistics. 1991: 169-176.
[3] Gale W A, Church K W. A Program for Aligning Sentences in Bilingual Corpora[J]. Computational Linguistics, 1993,19(1):75-102.
[4] 张霞, 昝红英, 张恩展. 汉英句子对齐长度计算方法的研究[J]. 计算机工程与设计, 2009,30(18):4356-4358.
[4] ( Zhang Xia, Zan Hongying, Zhang Enzhan. Study on Length Computation Method of Chinese-English Sentence Alignment[J]. Computer Engineering and Design, 2009,30(18):4356-4358.)
[5] Chuang T C, Yeh K C. Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria[J]. International Journal of Computational Linguistics & Chinese Language Processing, 2005,10(1):95-122.
[6] Simard M, Foster G F, Isabelle P. Abstract Using Cognates to Align Sentences in Bilingual Corpora[C] //Proceedings of the 4th International Congress on Theoretical & Methodological Issues in Machine Translation. 1992: 67-81.
[7] Church K W. Char_align: A Program for Aligning Parallel Texts at the Character Level[C] //Proceedings of the 31st Annual Meeting on Association for Computational Linguistics. 1993: 1-8.
[8] Melamed I D. Bitext Maps and Alignment via Pattern Recognition[J]. Computational Linguistics, 1999,25(1):107-130.
[9] Kay M, Röscheisen M. Text-translation Alignment[J]. Computational Linguistics, 1993,19(1):121-142.
[10] Ma X Y. Champollion: A Robust Parallel Text Sentence Aligner[C] //Proceedings of LREC-2006. 2006: 489-492.
[11] 李秀英. 基于历史典籍双语平行语料库的术语对齐研究[D]. 大连:大连理工大学, 2010.
[11] ( Li Xiuying. Term Translation Pair Alignment Based on a Bilingual Parallel Corpus of Chinese Historical Classics[D]. Dalian: Dalian University of Technology, 2010.)
[12] 李闻. 汉语古现句子对齐研究[C] //第十一届全国机器翻译研讨会(CWMT 2015), 中国,合肥. 2015: 90-96.
[12] ( Li Wen. Research on Alignment of Ancient Chinese Sentences to Modern Ones[C] //Proceedings of China Workshop on Machine Translation, Hefei, China. 2015: 90-96.)
[13] Wu D K. Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria[C] //Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics. 1994: 80-87.
[14] Moore R C. Fast and Accurate Sentence Alignment of Bilingual Corpora[C] //Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users. 2002: 135-144.
[15] Varga D, Halácsy P, Kornai A, et al. Parallel Corpora for Medium Density Languages[J]. Amsterdam Studies in the Theory and History of Linguistic Science Series 4, 2007. DOI: 10.1075/cilt.292.32var.
[16] Braune F, Fraser A. Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora[C] // Proceedings of the 23rd International Conference on Computational Linguistics: Posters. 2010: 81-89.
[17] Trieu H L, Nguyen P T, Nguyen K A. Improving Moore’s Sentence Alignment Method Using Bilingual Word Clustering[C] //Proceedings of the 5th International Conference KSE. 2014: 149-160.
[18] Trieu H L, Nguyen P T, Nguyen L M. A New Feature to Improve Moore’s Sentence Alignment Method[J]. VNU Journal of Science: Computer Science and Communication Engineering, 2015,31(1):32-44.
[19] 郭锐, 宋继华, 廖敏. 基于自动句对齐的相似古文句子检索[J]. 中文信息学报, 2008,22(2):87-91,105.
[19] ( Guo Rui Song Jihua Liao Min. Ancient Sentence Search Based on Sentence Auto Alignment in Parallel Corpus of Ancient and Modern Chinese[J]. Journal of Chinese Information Processing, 2008,22(2):87-91,105.)
[20] 钱丽萍, 赵铁军, 杨沫昀, 等. 基于译文的英汉双语句子自动对齐[J]. 计算机工程与应用, 2000,36(12):123-125.
[20] ( Qian Liping, Zhao Tiejun, Yang Moyun, et al. Translation-based Automatic Alignment of English and Chinese Parallel Corpora[J]. Computer Engineering and Applications, 2000,36(12):123-125.)
[21] 张艳, 柏冈秀纪. 基于长度的扩展方法的汉英句子对齐[J]. 中文信息学报, 2005,19(5):33-38, 60.
[21] ( Zhang Yan, Kashioka Hideki. Aligning Sentences in Chinese-English Corpora with Extended Length-based Approach[J]. Journal of Chinese Information Processing, 2005,19(5):33-38, 60.)
[22] 塞麦提·麦麦提敏, 侯敏, 吐尔根·依布拉音. 基于锚点句对的汉维句子对齐方法[J]. 计算机工程, 2015,41(4):166-170.
doi: 10. 3969/ j. issn. 1000-3428. 2015. 04. 031
[22] ( Saimaiti Maimaitimin, Hou Min, Tuergen Yibulayin. Chinese-Uyghur Sentence Alignment Method Based on Anchor Sentence Pairs[J]. Computer Engineering, 2015,41(4):166-170.)
doi: 10. 3969/ j. issn. 1000-3428. 2015. 04. 031
[23] 田生伟, 吐尔根·依布拉音, 禹龙, 等. 多策略汉维句子对齐[J]. 计算机科学, 2010,37(4):215-218, 292.
[23] ( Tian Shengwei, Tuergen Yibulayin, Yu Long, et al. Chinese-Uyhur Sentence Alignment Based on Hybrid Strategy[J]. Computer Science, 2010,37(4):215-218, 292.)
[24] 李文刚, 周杰, 杨保群. 基于词典和句长及位置的双语对齐方法的改进[J]. 现代电子技术, 2011,34(14):25-27.
[24] ( Li Wen’gang, Zhou Jie, Yang Baoqun. Improvement of Bilingual Sentence Alignment Method Based on Sentence Length and Location Information with Bidirectional Dictionary[J]. Modern Electronics Technique, 2011,34(14):25-27.)
[25] Sennrich R, Volk M. MT-based Sentence Alignment for OCR-generated Parallel Texts[C] //Proceedings of the 9th Conference of the Association for Machine Translation in the Americas (AMTA 2010). 2010.
[26] Fattah M A, Bracewell D B, Ren F J, et al. Sentence Alignment Using P-NNT and GMM[J]. Computer Speech and Language, 2007,21(4):594-608.
doi: 10.1016/j.csl.2007.01.002
[27] Fattah M A. The Use of MSVM and HMM for Sentence Alignment[J]. Journal of Information Processing Systems, 2012,8(2):301-314.
doi: 10.3745/JIPS.2012.8.2.301
[28] 刘颖, 王楠. 古汉语与现代汉语句子对齐研究[J]. 计算机应用与软件, 2013,30(11):127-130.
[28] ( Liu Ying, Wang Nan. Research on Classical and Modern Chinese Sentence Alignment[J]. Computer Applications and Software, 2013,30(11):127-130.)
[29] 刘颖, 王楠. 最大熵模型和BP神经网络的短句对齐比较[J]. 计算机工程与应用, 2015,51(7):112-117.
[29] ( Liu Ying, Wang Nan. Comparison of Clause Alignment Based on Maximum Entropy Model and Back Propagation Neural Network Model[J]. Computer Engineering and Applications, 2015,51(7):112-117.)
[30] 让子强. 汉老双语句子对齐方法研究[D]. 昆明: 昆明理工大学, 2017.
[30] ( Rang Ziqiang. Research on Chinese-Lao Bilingual Sentence Alignment Methods[D]. Kunming: Kunming University of Science and Technology, 2017.)
[31] 陈相, 林鸿飞, 杨志豪. 基于高斯混合模型的生物医学领域双语句子对齐[J]. 中文信息学报, 2010,24(4):68-73.
[31] ( Chen Xiang, Lin Hongfei, Yang Zhihao. Sentence Alignment for Biomedicine Texts Based on Gaussian Mixture Model[J]. Journal of Chinese Information Processing, 2010,24(4):68-73.)
[32] Cortes C, Vapnik V. Support-vector Networks[J]. Machine Learning, 1995,20(3):273-297.
[33] Jaynes E T. On the Rationale of Maximum-entropy Methods[J]. Proceedings of the IEEE, 1982,70(9):939-952.
doi: 10.1109/PROC.1982.12425
[34] Grégoire F, Langlais P. A Deep Neural Network Approach to Parallel Sentence Extraction[OL]. arXiv Preprint, arXiv:1709.09783.
[35] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997,9(8):1735-1780.
pmid: 9377276
[36] Cho K, Van Merrienboer B, Gulcehre C, et al. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation[OL]. arXiv Preprint, arXiv: 1406.1078.
[37] Hensman P, Masko D. The Impact of Imbalanced Training Data for Convolutional Neural Networks[EB/OL]. [2019-03-02]. https://www.kth.se/social/files/588617ebf2765401cfcc478c/PHensman DMasko_dkand15.pdf.
[1] Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Cheng. Sentence Alignment Method Based on BERT and Multi-similarity Fusion[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[2] Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[3] Wang Qian,Wang Dongbo,Li Bin,Xu Chao. Deep Learning Based Automatic Sentence Segmentation and Punctuation Model for Massive Classical Chinese Literature[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
[4] Zhao Yuxiang,Lian Jingwen. Review of Cultural Heritage Crowdsourcing in the Domain of Digital Humanities[J]. 数据分析与知识发现, 2021, 5(1): 36-55.
[5] Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[6] Liu Liu,Qin Tianyun,Wang Dongbo. Automatic Extraction of Traditional Music Terms of Intangible Cultural Heritage[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[7] Haici Yang,Jun Wang. Visualizing Knowledge Graph of Academic Inheritance in Song Dynasty[J]. 数据分析与知识发现, 2019, 3(6): 109-116.
[8] Yue Yuan,Dongbo Wang,Shuiqing Huang,Bin Li. The Comparative Study of Different Tagging Sets on Entity Extraction of Classical Books[J]. 数据分析与知识发现, 2019, 3(3): 57-65.
[9] Shao Jian, Zhang Chengzhi. Automatic Acquisition of Domain Parallel Corpora from Internet[J]. 现代图书情报技术, 2014, 30(12): 36-43.
[10] Xiong Wenxin. Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus[J]. 现代图书情报技术, 2013, (6): 36-41.
[11] Zhang Chengzhi,Huilin Wang. Survey on Multilingual Documents Clustering[J]. 现代图书情报技术, 2009, 25(6): 31-36.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn