Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (12): 36-43    DOI: 10.11925/infotech.1003-3513.2014.12.05
Current Issue | Archive | Adv Search |
Automatic Acquisition of Domain Parallel Corpora from Internet
Shao Jian, Zhang Chengzhi
School of Economics & Management, Nanjing University of Science & Technology, Nanjing 210094, China
Export: BibTeX | EndNote (RIS)      

[Objective] To automatically obtain domain parallel corpora via classified bilingual corpora and sentence alignment. [Methods] Classify bilingual corpora based on text classification technology, use sentence alignment tool to align classified bilingual corpus based on length information of bilingual sentence and bilingual dictionary. This paper uses artificial aligned bilingual corpora to calculate length parameters. [Results] The results obtain 95.45% rate of sentence aligned correctly. The length mean is 1.7777 and variance is 1.2640. [Limitations] Due to the extent of the initial alignment of bilingual corpus is satisfied, so the result of alignment is not universally representative. [Conclusions] The result proves the method presented in this paper is effective, so this method can acquire high quality domain parallel corpora.

Key wordsSentence alignment      Text classification      Parallel corpora      Machine translation     
Received: 30 June 2014      Published: 20 January 2015
:  TP361  

Cite this article:

Shao Jian, Zhang Chengzhi. Automatic Acquisition of Domain Parallel Corpora from Internet. New Technology of Library and Information Service, 2014, 30(12): 36-43.

URL:     OR

[1] Koehn P. Europarl: A Parallel Corpus for Statistical Machine Translation [C]. In: Proceedings of the 10th Machine Translation Summit, Phuket, Thailand. 2005: 79-86.
[2] 吴琳, 魏星, 霍翠婷. 基于 Web 的专利双语语料自动获取研究及实现——以esp@cenet数据库为例[J]. 现代图书情报技术, 2009(9): 57-63. (Wu Lin, Wei Xing, Huo Cuiting. Research and Implement of Automatic Patent Bilingual Corpus Extraction from Web——Taking esp@cenet as an Example [J]. New Technology of Library and Information Service, 2009(9): 57-63. )
[3] Resnik P, Smith N A. The Web as a Parallel Corpus [J]. Computational Linguistics, 2003, 29(3): 349-380.
[4] Ma X, Liberman M Y. BITS: A Method for Bilingual Text Search over the Web [C]. In: Proceedings of Machine Translation Summit VII, Singapore. 1999.
[5] Chen J, Nie J. Automatic Construction of Parallel English-Chinese Corpus for Cross-Language Information Retrieval [C]. In: Proceedings of the 6th Applied Natural Language Processing Conference, Seattle, Washington, USA. 2000: 21-28.
[6] Zhang Y, Wu K, Gao J, et al. Automatic Acquisition of Chinese-English Parallel Corpus from the Web [C]. In: Proceedings of the 28th European Conference on IR Research, London, UK. Springer Berlin Heidelberg, 2006: 420-431.
[7] Zhang C Z, Yao X C, Kit C. Finding More Bilingual Web Pages with High Credibility via Link Analysis [C]. In: Proceedings of the 6th Workshop on Building and Using Comparable Corpora, Sofia, Bulgaria. 2013.
[8] 刘奇, 刘洋, 孙茂松. URL 模式与 HTML 结构相结合的平行网页获取方法 [J]. 中文信息学报, 2013, 27(3): 91-99. (Liu Qi, Liu Yang, Sun Maosong. A Parallel Pages Mining Approach: Comibing URL Patterns and HTML Structures [J]. Journal of Chinese Information Processing, 2013, 27(3): 91-99.)
[9] Gale W A, Church K W. A Program for Aligning Sentences in Bilingual Corpora [J]. Computational Linguistics, 1993, 19(1): 75-102.
[10] Brown P F, Lai J C, Mercer R L. Aligning Sentences in Parallel Corpora [C]. In: Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, 1991: 169-176.
[11] Kay M, Röscheisen M. Text-translation Alignment [J]. Computational Linguistics, 1993, 19(1): 121-142.
[12] Chen S F. Aligning Sentences in Bilingual Corpora Using Lexical Information [C]. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, 1993.
[13] Church K W. Char_align: A Program for Aligning Parallel Texts at the Character Level [C]. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, 1993.
[14] Wu D. Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria [C]. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 1994: 80-87.
[15] Moore R C. Fast and Accurate Sentence Alignment of Bilingual Corpora [C]. In: Proceedings of the 5th Conference of the Association for Machine Translation in the Americas, Tiburon, CA, USA. 2002: 135-144.
[16] Fattah M A, Bracewell D B, Ren F, et al. Sentence Alignment Using P-NNT and GMM [J]. Computer Speech & Language, 2007,21(4): 594-608.
[17] Sennrich R, Volk M. MT-based Sentence Alignment for Ocr-generated Parallel Texts [C]. In: Proceedings of the 9th Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, Colorado, USA. 2010.
[18] Trieu H L, Nguyen P T, Nguyen K A. Improving Moore's Sentence Alignment Method Using Bilingual Word Clustering [C]. In: Proceedings of the 5th International Conference on Knowledge and Systems Engineering. Springer International Publishing, 2014: 149-160.
[19] 熊文新. 英汉环保领域平行语料的句对齐与再对齐[J]. 现代图书情报技术, 2013(6): 36-41. (Xiong Wenxin. Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus [J]. New Technology of Library and Information Service, 2013(6): 36-41.)
[20] Vapnik V N. The Nature of Statistical Learning Theory[M]. Springer New York, 2000.
[21] Forman G. An Extensive Empirical Study of Feature Selection Metrics for Text Classification [J]. Journal of Machine Learning Research, 2003, 3: 1289-1305.
[22] Mesleh A M A. Chi Square Feature Extraction Based SVMs Arabic Language Text Categorization System [J]. Journal of Computer Science, 2007, 3(6): 430-435.
[23] Peng H, Long F, Ding C. Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(8): 1226-1238.
[24] Robertson S. Understanding Inverse Document Frequency: On Theoretical Arguments for IDF [J]. Journal of Documentation, 2004, 60(5): 503-520.
[25] Varga D, Halácsy P, Kornai A, et al. Parallel Corpora for Medium Density Languages [A].// Recent Advances in Natural Language Processing IV [M]. John Benjamins Publishing Company, 2007: 247-258.
[26] Ma X. Champollion: A Robust Parallel Text Sentence Aligner [C]. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, 2006.
[27] Chang C C, Lin C J. LIBSVM: A Library for Support Vector Machines [J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): Article No.27.
[28] Stolcke A. SRILM -An Extensible Language Modeling Toolkit [C]. In: Proceedings of the 7th International Conference on Spoken Language Processing, Denver, Colorado, USA. 2002.
[29] Xiao T, Zhu J, Zhang H, et al. NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation [C]. In: Proceedings of the ACL 2012 System Demonstrations, 2012.
[30] Przybocki M A, Peterson K, Bronsart S. Translation Adequacy and Preference Evaluation Tool (TAP-ET) [C]. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC, Marrakech, Morocco. 2008.

[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Cheng. Sentence Alignment Method Based on BERT and Multi-similarity Fusion[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[4] Yu Bengong,Zhu Xiaojie,Zhang Ziwei. A Capsule Network Model for Text Classification with Multi-level Feature Extraction[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[5] Wang Yan, Wang Huyan, Yu Bengong. Chinese Text Classification with Feature Fusion[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[6] Liang Jiwen,Jiang Chuan,Wang Dongbo. Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
[7] Wang Sidi,Hu Guangwei,Yang Siyu,Shi Yun. Automatic Transferring Government Website E-Mails Based on Text Classification[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[8] Shi Lei,Wang Yi,Cheng Ying,Wei Ruibin. Review of Attention Mechanism in Natural Language Processing[J]. 数据分析与知识发现, 2020, 4(5): 1-14.
[9] Xu Yuemei,Liu Yunwen,Cai Lianqiao. Predicitng Retweets of Government Microblogs with Deep-combined Features[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[10] Xu Tongtong,Sun Huazhi,Ma Chunmei,Jiang Lifen,Liu Yichen. Classification Model for Few-shot Texts Based on Bi-directional Long-term Attention Features[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[11] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[12] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[13] Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[14] Heran Qin,Liu Liu,Bin Li,Dongbo Wang. Automatic Classification of Ancient Classics with Entity Features[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[15] Guo Chen,Tianxiang Xu. Sentence Function Recognition Based on Active Learning[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938