Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (9): 123-132     https://doi.org/10.11925/infotech.2096-3467.2019.0268
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于多特征融合的先秦典籍汉英句子对齐研究*
梁继文1,江川2,王东波2,3()
1南京大学信息管理学院 南京 210023
2南京农业大学信息科学技术学院 南京 210095
3鲁汶大学比利时政府研发监测中心(ECOOM) 鲁汶 B-3000
Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion
Liang Jiwen1,Jiang Chuan2,Wang Dongbo2,3()
1School of Information Management, Nanjing University, Nanjing 210023, China
2College of Information Science & Technology, Nanjing Agricultural University, Nanjing 210095, China
3Facultair Onderzoekscentrum ECOOM, KU Leuven, Leuven B-3000, Belgium
全文: PDF (1011 KB)   HTML ( 13
输出: BibTeX | EndNote (RIS)      
摘要 

目的】 实现先秦典籍古文-英文双语句子自动对齐,为构建典籍双语句级平行语料库、跨语言检索提供支持。【方法】 将典籍汉英句子自动对齐问题视为候选句对分类问题,根据实验语料特点,结合已有研究选取对齐句对特征,基于“整体分类”与“序列标注”两种不同的理念,识别候选句对中的对齐句对。【结果】 在序列标注实验中,LSTM-CRF模型的句子对齐效果最佳F值为92.67%;在整体分类实验中,SVM识别效果最佳F值为90.63%;在特征组合实验中,同时使用4种特征的F值为91.01%,效果优于其他特征组合。【局限】 有待补充类型更丰富的原始语料。【结论】 融合4种特征的LSTM-CRF神经网络模型能够有效识别古文-英文对齐句对,实现典籍双语句子自动对齐。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
梁继文
江川
王东波
关键词 句子对齐多语言信息处理汉英平行语料先秦典籍数字人文    
Abstract

[Objective] This paper proposes a method automatically aligning Chinese sentences from Pre-Qin Literature with their English translations, aiming to construct bilingual sentence-level parallel corpus and support cross-language retrieval.[Methods] First, we modified classification method for parallel sentence pairs to align bilingual sentences from historical literature. Based on the characteristics of bilingual corpus, we retrieved features of bilingual sentence pairs. Finally, with “sequence labeling” and “overall classification”, we identified aligned pairs from candidate sentences.[Results] In the sequence labeling experiment, the LSTM-CRF model yielded the best performance with its F value reaching 92.67%. In the overall classification experiment, the SVM had the best results with a F value of 90.63%. In the experiment combining all four features, the F value was 91.01%.[Limitations] The corpus size needs to be expanded.[Conclusions] The LSTM-CRF model with four features could effectively align ancient Chinese sentences with their English translations.

Key wordsSentence Alignment    Multilingual Information Processing    Chinese-English Parallel Corpus    Pre-Qin Literature    Digital Humanities
收稿日期: 2019-03-11      出版日期: 2020-06-17
ZTFLH:  G351  
基金资助:*本文系国家自然科学基金面上项目“基于典籍引得的句法级汉英平行语料库构建及人文计算研究”的研究成果之一(71673143)
通讯作者: 王东波     E-mail: db.wang@njau.edu.cn
引用本文:   
梁继文,江川,王东波. 基于多特征融合的先秦典籍汉英句子对齐研究*[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
Liang Jiwen,Jiang Chuan,Wang Dongbo. Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion. Data Analysis and Knowledge Discovery, 2020, 4(9): 123-132.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.0268      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I9/123
Fig.1  LSTM-CRF模型示例
Fig.2  对齐句对双语长度分布
Fig.3  对齐句对双语句长关系分布
模型 P R F Kappa
SVM 90.19% 90.02% 90.63% 80.02%
MaxEnt 98.57% 76.01% 85.83% 73.59%
MLP 88.18% 89.25% 88.71% 76.19%
Table 1  整体分类模型句子对齐实验结果
模型 P R F
LSTM 84.07% 93.64% 88.60%
LSTM-CRF 97.35% 88.42% 92.67%
GRU 81.68% 78.65% 80.61%
GRU-CRF 90.35% 89.60% 89.97%
Table 2  序列标注模型实验结果
Fig.4  特征组合句子对齐综合性能比较
Fig.5  特征组合与单一特征实验结果对比
[1] Guo M, Shen Q L, Yang Y F, et al. Effective Parallel Corpus Mining Using Bilingual Sentence Embeddings[OL]. arXiv Preprint, arXiv:1807.11906.
[2] Brown P F, Lai J C, Mercer R L. Aligning Sentences in Parallel Corpora[C] //Proceedings of the 29th Annual Meeting on Association for Computational Linguistics. 1991: 169-176.
[3] Gale W A, Church K W. A Program for Aligning Sentences in Bilingual Corpora[J]. Computational Linguistics, 1993,19(1):75-102.
[4] 张霞, 昝红英, 张恩展. 汉英句子对齐长度计算方法的研究[J]. 计算机工程与设计, 2009,30(18):4356-4358.
[4] ( Zhang Xia, Zan Hongying, Zhang Enzhan. Study on Length Computation Method of Chinese-English Sentence Alignment[J]. Computer Engineering and Design, 2009,30(18):4356-4358.)
[5] Chuang T C, Yeh K C. Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria[J]. International Journal of Computational Linguistics & Chinese Language Processing, 2005,10(1):95-122.
[6] Simard M, Foster G F, Isabelle P. Abstract Using Cognates to Align Sentences in Bilingual Corpora[C] //Proceedings of the 4th International Congress on Theoretical & Methodological Issues in Machine Translation. 1992: 67-81.
[7] Church K W. Char_align: A Program for Aligning Parallel Texts at the Character Level[C] //Proceedings of the 31st Annual Meeting on Association for Computational Linguistics. 1993: 1-8.
[8] Melamed I D. Bitext Maps and Alignment via Pattern Recognition[J]. Computational Linguistics, 1999,25(1):107-130.
[9] Kay M, Röscheisen M. Text-translation Alignment[J]. Computational Linguistics, 1993,19(1):121-142.
[10] Ma X Y. Champollion: A Robust Parallel Text Sentence Aligner[C] //Proceedings of LREC-2006. 2006: 489-492.
[11] 李秀英. 基于历史典籍双语平行语料库的术语对齐研究[D]. 大连:大连理工大学, 2010.
[11] ( Li Xiuying. Term Translation Pair Alignment Based on a Bilingual Parallel Corpus of Chinese Historical Classics[D]. Dalian: Dalian University of Technology, 2010.)
[12] 李闻. 汉语古现句子对齐研究[C] //第十一届全国机器翻译研讨会(CWMT 2015), 中国,合肥. 2015: 90-96.
[12] ( Li Wen. Research on Alignment of Ancient Chinese Sentences to Modern Ones[C] //Proceedings of China Workshop on Machine Translation, Hefei, China. 2015: 90-96.)
[13] Wu D K. Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria[C] //Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics. 1994: 80-87.
[14] Moore R C. Fast and Accurate Sentence Alignment of Bilingual Corpora[C] //Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users. 2002: 135-144.
[15] Varga D, Halácsy P, Kornai A, et al. Parallel Corpora for Medium Density Languages[J]. Amsterdam Studies in the Theory and History of Linguistic Science Series 4, 2007. DOI: 10.1075/cilt.292.32var.
[16] Braune F, Fraser A. Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora[C] // Proceedings of the 23rd International Conference on Computational Linguistics: Posters. 2010: 81-89.
[17] Trieu H L, Nguyen P T, Nguyen K A. Improving Moore’s Sentence Alignment Method Using Bilingual Word Clustering[C] //Proceedings of the 5th International Conference KSE. 2014: 149-160.
[18] Trieu H L, Nguyen P T, Nguyen L M. A New Feature to Improve Moore’s Sentence Alignment Method[J]. VNU Journal of Science: Computer Science and Communication Engineering, 2015,31(1):32-44.
[19] 郭锐, 宋继华, 廖敏. 基于自动句对齐的相似古文句子检索[J]. 中文信息学报, 2008,22(2):87-91,105.
[19] ( Guo Rui Song Jihua Liao Min. Ancient Sentence Search Based on Sentence Auto Alignment in Parallel Corpus of Ancient and Modern Chinese[J]. Journal of Chinese Information Processing, 2008,22(2):87-91,105.)
[20] 钱丽萍, 赵铁军, 杨沫昀, 等. 基于译文的英汉双语句子自动对齐[J]. 计算机工程与应用, 2000,36(12):123-125.
[20] ( Qian Liping, Zhao Tiejun, Yang Moyun, et al. Translation-based Automatic Alignment of English and Chinese Parallel Corpora[J]. Computer Engineering and Applications, 2000,36(12):123-125.)
[21] 张艳, 柏冈秀纪. 基于长度的扩展方法的汉英句子对齐[J]. 中文信息学报, 2005,19(5):33-38, 60.
[21] ( Zhang Yan, Kashioka Hideki. Aligning Sentences in Chinese-English Corpora with Extended Length-based Approach[J]. Journal of Chinese Information Processing, 2005,19(5):33-38, 60.)
[22] 塞麦提·麦麦提敏, 侯敏, 吐尔根·依布拉音. 基于锚点句对的汉维句子对齐方法[J]. 计算机工程, 2015,41(4):166-170.
doi: 10. 3969/ j. issn. 1000-3428. 2015. 04. 031
[22] ( Saimaiti Maimaitimin, Hou Min, Tuergen Yibulayin. Chinese-Uyghur Sentence Alignment Method Based on Anchor Sentence Pairs[J]. Computer Engineering, 2015,41(4):166-170.)
doi: 10. 3969/ j. issn. 1000-3428. 2015. 04. 031
[23] 田生伟, 吐尔根·依布拉音, 禹龙, 等. 多策略汉维句子对齐[J]. 计算机科学, 2010,37(4):215-218, 292.
[23] ( Tian Shengwei, Tuergen Yibulayin, Yu Long, et al. Chinese-Uyhur Sentence Alignment Based on Hybrid Strategy[J]. Computer Science, 2010,37(4):215-218, 292.)
[24] 李文刚, 周杰, 杨保群. 基于词典和句长及位置的双语对齐方法的改进[J]. 现代电子技术, 2011,34(14):25-27.
[24] ( Li Wen’gang, Zhou Jie, Yang Baoqun. Improvement of Bilingual Sentence Alignment Method Based on Sentence Length and Location Information with Bidirectional Dictionary[J]. Modern Electronics Technique, 2011,34(14):25-27.)
[25] Sennrich R, Volk M. MT-based Sentence Alignment for OCR-generated Parallel Texts[C] //Proceedings of the 9th Conference of the Association for Machine Translation in the Americas (AMTA 2010). 2010.
[26] Fattah M A, Bracewell D B, Ren F J, et al. Sentence Alignment Using P-NNT and GMM[J]. Computer Speech and Language, 2007,21(4):594-608.
doi: 10.1016/j.csl.2007.01.002
[27] Fattah M A. The Use of MSVM and HMM for Sentence Alignment[J]. Journal of Information Processing Systems, 2012,8(2):301-314.
doi: 10.3745/JIPS.2012.8.2.301
[28] 刘颖, 王楠. 古汉语与现代汉语句子对齐研究[J]. 计算机应用与软件, 2013,30(11):127-130.
[28] ( Liu Ying, Wang Nan. Research on Classical and Modern Chinese Sentence Alignment[J]. Computer Applications and Software, 2013,30(11):127-130.)
[29] 刘颖, 王楠. 最大熵模型和BP神经网络的短句对齐比较[J]. 计算机工程与应用, 2015,51(7):112-117.
[29] ( Liu Ying, Wang Nan. Comparison of Clause Alignment Based on Maximum Entropy Model and Back Propagation Neural Network Model[J]. Computer Engineering and Applications, 2015,51(7):112-117.)
[30] 让子强. 汉老双语句子对齐方法研究[D]. 昆明: 昆明理工大学, 2017.
[30] ( Rang Ziqiang. Research on Chinese-Lao Bilingual Sentence Alignment Methods[D]. Kunming: Kunming University of Science and Technology, 2017.)
[31] 陈相, 林鸿飞, 杨志豪. 基于高斯混合模型的生物医学领域双语句子对齐[J]. 中文信息学报, 2010,24(4):68-73.
[31] ( Chen Xiang, Lin Hongfei, Yang Zhihao. Sentence Alignment for Biomedicine Texts Based on Gaussian Mixture Model[J]. Journal of Chinese Information Processing, 2010,24(4):68-73.)
[32] Cortes C, Vapnik V. Support-vector Networks[J]. Machine Learning, 1995,20(3):273-297.
[33] Jaynes E T. On the Rationale of Maximum-entropy Methods[J]. Proceedings of the IEEE, 1982,70(9):939-952.
doi: 10.1109/PROC.1982.12425
[34] Grégoire F, Langlais P. A Deep Neural Network Approach to Parallel Sentence Extraction[OL]. arXiv Preprint, arXiv:1709.09783.
[35] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997,9(8):1735-1780.
pmid: 9377276
[36] Cho K, Van Merrienboer B, Gulcehre C, et al. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation[OL]. arXiv Preprint, arXiv: 1406.1078.
[37] Hensman P, Masko D. The Impact of Imbalanced Training Data for Convolutional Neural Networks[EB/OL]. [2019-03-02]. https://www.kth.se/social/files/588617ebf2765401cfcc478c/PHensman DMasko_dkand15.pdf.
[1] 刘文斌, 何彦青, 吴振峰, 董诚. 基于BERT和多相似度融合的句子对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[2] 张琪,江川,纪有书,冯敏萱,李斌,许超,刘浏. 面向多领域先秦典籍的分词词性一体化自动标注模型构建*[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[3] 王倩,王东波,李斌,许超. 面向海量典籍文本的深度学习自动断句与标点平台构建研究*[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
[4] 纪有书, 王东波, 黄水清. 基于词对齐的古汉语同义词自动抽取研究*——以前四史典籍为例[J]. 数据分析与知识发现, 2021, 5(11): 135-144.
[5] 赵宇翔,练靖雯. 数字人文视域下文化遗产众包研究综述*[J]. 数据分析与知识发现, 2021, 5(1): 36-55.
[6] 徐晨飞, 叶海影, 包平. 基于深度学习的方志物产资料实体自动识别模型构建研究*[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[7] 刘浏,秦天允,王东波. 非物质文化遗产传统音乐术语自动抽取*[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[8] 杨海慈,王军. 宋代学术师承知识图谱的构建与可视化[J]. 数据分析与知识发现, 2019, 3(6): 109-116.
[9] 袁悦,王东波,黄水清,李斌. 不同词性标记集在典籍实体抽取上的差异性探究*[J]. 数据分析与知识发现, 2019, 3(3): 57-65.
[10] 邵健, 章成志. 从互联网上自动获取领域平行语料[J]. 现代图书情报技术, 2014, 30(12): 36-43.
[11] 章成志,王惠临. 多语言文本聚类研究综述*[J]. 现代图书情报技术, 2009, 25(6): 31-36.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn