Please wait a minute...
New Technology of Library and Information Service  2012, Vol. 28 Issue (2): 28-33    DOI: 10.11925/infotech.1003-3513.2012.02.05
Current Issue | Archive | Adv Search |
Chinese-English Comparable Corpus Construction for Bilingual Terminology Extraction
Kang Xiaoli1, Zhang Chengzhi2
1. Library of Nanchang University, Nanchang 330031, China;
2. Department of Information Management, Nanjing University of Science and Technology, Nanjing 210094, China
Export: BibTeX | EndNote (RIS)      
Abstract  In this paper, the process of building comparable corpus in special domain for bilingual terminology is designed. Firstly, bilingual sample corpus in a special domain is collected, and keywords are extracted from the sample corpus based on word co-occurrence method. Then, these keywords are used to be a query to a scholar search engine, and the searched result is used to be candidate comparable corpus. Finally, the comparable corpus in the special domain is obtained after filtering noise documents by quantitative evaluation.
Key wordsComparable corpus      Corpus construction      Bilingual terminology extraction     
Received: 04 January 2012      Published: 23 March 2012



Cite this article:

Kang Xiaoli, Zhang Chengzhi. Chinese-English Comparable Corpus Construction for Bilingual Terminology Extraction. New Technology of Library and Information Service, 2012, 28(2): 28-33.

URL:     OR

[1] 孙广范,宋金平,袁琦,等.中英可比语料库中翻译等价对抽取方法研究[J]. 计算机工程与应用 , 2007, 43(32): 44-46.(Sun Guangfan,Song Jinping,Yuan Qi,et a1.Research on Extraction of Translation Equivalents from Chinese-English Comparable Corpus[J].Computer Engineering and Applications,2007,43(32):44-46.)

[2] Talvensaari T, Laurikkala J, Jarvelin K, et al. Creating and Exploiting a Comparable Corpus in Cross- Language Information Retrieval [J]. ACM Transactions on Information Systems,2007,25(1):322-334.

[3] Talvensaari T, Laurikkala J, Jarvelin K, et al. A Study on Automatic Creation of a Comparable Document Collection in Cross-Language Information Retrieval [J]. Journal of Documentation,2006,62(3):372-387.

[4] Gliozzo A, Strapparava C. Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization [C]. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics,2006:553-560.

[5] Do T N D, Le V B, Bigi B, et al. Mining a Comparable Text Corpus for a Vietnamese-French Statistical Machine Translation System[C].In:Proceedings of the 4th Workshop on Statistical Machine Translation. Stroudsburg, PA:Association for Computational Linguistics, 2009:165-172.

[6] Sheridan P, Ballerini J P. Experiments in Multilingual Information Retrieval Using the SPIDER System [C]. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York : Association for Computing Machinery,1996:58-65.

[7] Braschler M, Schäuble P. Multilingual Information Retrieval Based on Document Alignment Techniques [C]. In: Proceedings of the 2nd European Conference on Research and Advanced Technology for Digital Libraries. London, UK:Springer-Verlag,1998:183-197.

[8] Leturia I, Vicente I S,Saralegi X. Search Engine Based Approaches for Collecting Domain-Specific Basque-English Comparable Corpora from the Internet [C].In: Proceedings of the 5th Web as Corpus Workshop (WAC5). San Sebastian: Elhuyar Fundazioa, 2009:53-61.

[9] Pirkola A,Leppanen E,Jarvelin K.The RATF Formula (Kwok's Formula): Exploiting Average Term Frequency in Cross-Language Retrieval[J/OL]. Information Research,2002,7(2). [2010-01-05].

[10] Keskustalo H, Hedlund T, Airio E. UTACLIR-General Query Translation Framework for Several Language Pairs[C]. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York : Association for Computing Machinery, 2002:448.

[11] Lemur.The Lemur Toolkit for Language Modeling and Information Retrieval [EB/OL]. (2009-12-21).[2010-01-05].

[12] Collier N, Kumano A, Hirakawa H. An Application of Local Relevance Feedback for Building Comparable Corpora from News Article Matching [J]. Natl Inst Inform, 2003(5):9-23.

[13] Rogati M, Yang Y M.CMU PRF Using a Comparable Corpus: CLEF Working Notes [C]. In: Proceedings of Working Notes for the Cross-Language Evaluation Forum(CLEF 2001) Workshop. Berlin:Springer-Verlag, 2001:81-86.

[14] Layiosa-Braithwaits S. Ensino Das Linguas Vivas no Superior em Portugal [M].Porto: Faculdade de Letras da Universidade do Porto,1999:307-317.

[15] Talvensaari T, Pirkola A, Jaervelin K, et al. Focused Web Crawling in the Acquisition of Comparable Corpora [J]. Information Retrieval,2008,11(5):427-445.

[16] Baroni M, Bernardini S. BootCaT: Bootstrapping Corpora and Terms from the Web[C]. In: Proceedings of International Conference on Language Resources and Evaluation(LREC2004). Paris:European Language Resources Association, 2004:1313-1316.

[17] Rapp R.Identifying Word Translations in Non-parallel Texts[C].In: Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics,1995:320-322.

[18] Tanaka K, Iwasaki H.Extraction of Lexical Translations from Non-aligned Corpora[C]. In: Proceedings of the 16th International Conference on Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics,1996: 580-585.

[19] Shahzad I, Ohtake K, Masuyama S, et al. Identifying Translations of Compound Nouns Using Non-aligned Corpora[C].In: Proceedings of the Workshop on Multilingual Information Processing and Asian Language Processing. San Francisco:Morgan Kaufmann Publishers,1999: 108-113.

[20] Fung P. A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora[C]. In: Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup. LNAI 1529, Berlin: Springer-Verlag,1998:1-17.

[21] Sanderson M,Croft B. Deriving Concept Hierarchies from Text [C]. In: Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:Association for Computing Machinery,1999:206-213.

[22] 祝清松, 王惠临.中英文句法分析系统及验证平台的设计与实现[J]. 现代图书情报技术 ,2010 (2):38-43.(Zhu Qingsong,Wang Huilin.A Syntactic Analysis System and Verification Platform for Chinese and English [J]. New Technology of Library and Information Service,2010(2):38-43.)

[23] Pearson J. Terms in Context [M]. Admsterdam: John Benjamins Publishing Company,1998:123-124.

[24] 夏云,李德凤.可比语料量化比较分析与应用文体翻译——一项基于自建小型语料库的研究[C].见: 第18届世界翻译大会论文集 .北京:外文出版社,2008: 561-566.(Xia Yun,Li Defeng. Quatitative Comparative Analysis of Comparable Corpus and Translation of Practical Style-A Research Based on Self-built Small Corpus[C].In: Proceedings of the 18 FIT World Congress Proceedings. Beijing:Foreign Languages Press,2008: 561-566.)

[25] Salton G, McGill M J. Introduction to Modern Information Retrieval [M]. New York: McGraw Hill Book Company, 1983:201-203.
[1] Guan Xiaoda,Lv Xueqiang,Li Zhuo,Zheng Luexing,. Chinese Organization Name Recognition in User Query Log[J]. 现代图书情报技术, 2014, 30(1): 72-78.
[2] Feng Guanjun, Yu Long, Tian Shengwei. Auto Construction of Uyghur Emotional Words Corpus Based on CRFs[J]. 现代图书情报技术, 2011, 27(3): 17-21.
[3] Kang Xiaoli,Zhang Chengzhi,Wang Huilin. Survey on Bilingual Terminology Extraction from Comparable Corpora[J]. 现代图书情报技术, 2009, (10): 7-13.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938