Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (12): 36-43    DOI: 10.11925/infotech.1003-3513.2014.12.05
Current Issue | Archive | Adv Search |
Automatic Acquisition of Domain Parallel Corpora from Internet
Shao Jian, Zhang Chengzhi
School of Economics & Management, Nanjing University of Science & Technology, Nanjing 210094, China
Download: PDF(487 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] To automatically obtain domain parallel corpora via classified bilingual corpora and sentence alignment. [Methods] Classify bilingual corpora based on text classification technology, use sentence alignment tool to align classified bilingual corpus based on length information of bilingual sentence and bilingual dictionary. This paper uses artificial aligned bilingual corpora to calculate length parameters. [Results] The results obtain 95.45% rate of sentence aligned correctly. The length mean is 1.7777 and variance is 1.2640. [Limitations] Due to the extent of the initial alignment of bilingual corpus is satisfied, so the result of alignment is not universally representative. [Conclusions] The result proves the method presented in this paper is effective, so this method can acquire high quality domain parallel corpora.

Key wordsSentence alignment      Text classification      Parallel corpora      Machine translation     
Received: 30 June 2014      Published: 20 January 2015
:  TP361  

Cite this article:

Shao Jian, Zhang Chengzhi. Automatic Acquisition of Domain Parallel Corpora from Internet. New Technology of Library and Information Service, 2014, 30(12): 36-43.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2014.12.05     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2014/V30/I12/36

[1] Koehn P. Europarl: A Parallel Corpus for Statistical Machine Translation [C]. In: Proceedings of the 10th Machine Translation Summit, Phuket, Thailand. 2005: 79-86.
[2] 吴琳, 魏星, 霍翠婷. 基于 Web 的专利双语语料自动获取研究及实现——以esp@cenet数据库为例[J]. 现代图书情报技术, 2009(9): 57-63. (Wu Lin, Wei Xing, Huo Cuiting. Research and Implement of Automatic Patent Bilingual Corpus Extraction from Web——Taking esp@cenet as an Example [J]. New Technology of Library and Information Service, 2009(9): 57-63. )
[3] Resnik P, Smith N A. The Web as a Parallel Corpus [J]. Computational Linguistics, 2003, 29(3): 349-380.
[4] Ma X, Liberman M Y. BITS: A Method for Bilingual Text Search over the Web [C]. In: Proceedings of Machine Translation Summit VII, Singapore. 1999.
[5] Chen J, Nie J. Automatic Construction of Parallel English-Chinese Corpus for Cross-Language Information Retrieval [C]. In: Proceedings of the 6th Applied Natural Language Processing Conference, Seattle, Washington, USA. 2000: 21-28.
[6] Zhang Y, Wu K, Gao J, et al. Automatic Acquisition of Chinese-English Parallel Corpus from the Web [C]. In: Proceedings of the 28th European Conference on IR Research, London, UK. Springer Berlin Heidelberg, 2006: 420-431.
[7] Zhang C Z, Yao X C, Kit C. Finding More Bilingual Web Pages with High Credibility via Link Analysis [C]. In: Proceedings of the 6th Workshop on Building and Using Comparable Corpora, Sofia, Bulgaria. 2013.
[8] 刘奇, 刘洋, 孙茂松. URL 模式与 HTML 结构相结合的平行网页获取方法 [J]. 中文信息学报, 2013, 27(3): 91-99. (Liu Qi, Liu Yang, Sun Maosong. A Parallel Pages Mining Approach: Comibing URL Patterns and HTML Structures [J]. Journal of Chinese Information Processing, 2013, 27(3): 91-99.)
[9] Gale W A, Church K W. A Program for Aligning Sentences in Bilingual Corpora [J]. Computational Linguistics, 1993, 19(1): 75-102.
[10] Brown P F, Lai J C, Mercer R L. Aligning Sentences in Parallel Corpora [C]. In: Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, 1991: 169-176.
[11] Kay M, Röscheisen M. Text-translation Alignment [J]. Computational Linguistics, 1993, 19(1): 121-142.
[12] Chen S F. Aligning Sentences in Bilingual Corpora Using Lexical Information [C]. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, 1993.
[13] Church K W. Char_align: A Program for Aligning Parallel Texts at the Character Level [C]. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, 1993.
[14] Wu D. Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria [C]. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 1994: 80-87.
[15] Moore R C. Fast and Accurate Sentence Alignment of Bilingual Corpora [C]. In: Proceedings of the 5th Conference of the Association for Machine Translation in the Americas, Tiburon, CA, USA. 2002: 135-144.
[16] Fattah M A, Bracewell D B, Ren F, et al. Sentence Alignment Using P-NNT and GMM [J]. Computer Speech & Language, 2007,21(4): 594-608.
[17] Sennrich R, Volk M. MT-based Sentence Alignment for Ocr-generated Parallel Texts [C]. In: Proceedings of the 9th Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, Colorado, USA. 2010.
[18] Trieu H L, Nguyen P T, Nguyen K A. Improving Moore's Sentence Alignment Method Using Bilingual Word Clustering [C]. In: Proceedings of the 5th International Conference on Knowledge and Systems Engineering. Springer International Publishing, 2014: 149-160.
[19] 熊文新. 英汉环保领域平行语料的句对齐与再对齐[J]. 现代图书情报技术, 2013(6): 36-41. (Xiong Wenxin. Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus [J]. New Technology of Library and Information Service, 2013(6): 36-41.)
[20] Vapnik V N. The Nature of Statistical Learning Theory[M]. Springer New York, 2000.
[21] Forman G. An Extensive Empirical Study of Feature Selection Metrics for Text Classification [J]. Journal of Machine Learning Research, 2003, 3: 1289-1305.
[22] Mesleh A M A. Chi Square Feature Extraction Based SVMs Arabic Language Text Categorization System [J]. Journal of Computer Science, 2007, 3(6): 430-435.
[23] Peng H, Long F, Ding C. Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(8): 1226-1238.
[24] Robertson S. Understanding Inverse Document Frequency: On Theoretical Arguments for IDF [J]. Journal of Documentation, 2004, 60(5): 503-520.
[25] Varga D, Halácsy P, Kornai A, et al. Parallel Corpora for Medium Density Languages [A].// Recent Advances in Natural Language Processing IV [M]. John Benjamins Publishing Company, 2007: 247-258.
[26] Ma X. Champollion: A Robust Parallel Text Sentence Aligner [C]. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, 2006.
[27] Chang C C, Lin C J. LIBSVM: A Library for Support Vector Machines [J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): Article No.27.
[28] Stolcke A. SRILM -An Extensible Language Modeling Toolkit [C]. In: Proceedings of the 7th International Conference on Spoken Language Processing, Denver, Colorado, USA. 2002.
[29] Xiao T, Zhu J, Zhang H, et al. NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation [C]. In: Proceedings of the ACL 2012 System Demonstrations, 2012.
[30] Przybocki M A, Peterson K, Bronsart S. Translation Adequacy and Preference Evaluation Tool (TAP-ET) [C]. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC, Marrakech, Morocco. 2008.

[1] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[2] Qingmin Liu,Changqing Yao,Chongde Shi,Xiaojie Wen,Yueying Sun. Vocabulary Optimization of Neural Machine Translation for Scientific and Technical Document[J]. 数据分析与知识发现, 2019, 3(3): 76-82.
[3] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[4] Xinlei Li,Hao Wang,Xiaomin Liu,Sanhong Deng. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[5] Liu Liu,Dongbo Wang. Identifying Interdisciplinary Social Science Research Based on Article Classification[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[6] Xiangdong Li,Tao Ruan,Kang Liu. Automatic Classification of Documents from Wikipedia[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[7] Yonghe Lu,Jinghuang Chen. Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[8] Qun Zhang, Hongjun Wang, Lunwen Wang. Classifying Short Texts with Word Embedding and LDA Model[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[9] Hu Juxiang, Lv Xueqiang, Liu Kehui. Complaint Text Classification Based on Guiding Words[J]. 现代图书情报技术, 2015, 31(7-8): 97-103.
[10] Li Xiangdong, Ba Zhichao, Huang Li. Allocation and Multi-granularity[J]. 现代图书情报技术, 2015, 31(5): 42-49.
[11] Lu Yonghe, Wang Hongbin. Feature Weighting Method Affected by Part of Speech in Text Classification[J]. 现代图书情报技术, 2015, 31(4): 18-25.
[12] Li Xiangdong, Cao Huan, Ding Cong, Huang Li. Short-text Classification Based on HowNet and Domain Keyword Set Extension[J]. 现代图书情报技术, 2015, 31(2): 31-38.
[13] Liu Huailiang, Du Kun, Qin Chunxiu. Research on Chinese Text Categorization Based on Semantic Similarity of HowNet[J]. 现代图书情报技术, 2015, 31(2): 39-45.
[14] Du Kun, Liu Huailiang, Guo Lujie. Study on the Modified Method of Feature Weighting with Complex Networks[J]. 现代图书情报技术, 2015, 31(11): 26-32.
[15] Li Xiangdong, Ba Zhichao, Huang Li. A Method for Eliminating Noise in Text Classification Based on Category Distribution Characteristics[J]. 现代图书情报技术, 2014, 30(11): 66-72.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn