用于双语术语抽取的专业领域中英文可比语料库构建

doi:10.11925/infotech.1003-3513.2012.02.05

现代图书情报技术

2012, Vol. 28

Issue (2): 28-33 https://doi.org/10.11925/infotech.1003-3513.2012.02.05

知识组织与知识管理

本期目录 | 过刊浏览 | 高级检索

用于双语术语抽取的专业领域中英文可比语料库构建

康小丽¹, 章成志²

1. 南昌大学图书馆南昌 330031;
2. 南京理工大学信息管理系南京 210094

Chinese-English Comparable Corpus Construction for Bilingual Terminology Extraction

Kang Xiaoli¹, Zhang Chengzhi²

1. Library of Nanchang University, Nanchang 330031, China;
2. Department of Information Management, Nanjing University of Science and Technology, Nanjing 210094, China

摘要
参考文献
相关文章
Metrics

全文: PDF (540 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要面向双语术语抽取这一应用目标,提出专业领域可比语料库的构建方案并进行实验论证。针对给定的主题领域分别进行中英文专业语料的采集,从中分别获取中英文关键词,根据词语共现统计获取该主题领域的其他相关关键词;以这些关键词作为查询入口,通过学术搜索引擎从网络获取候选可比语料;对可比语料进行定量评估,以剔除不符合要求的语料,最终得到特定主题领域的可比语料库。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	章成志
	康小丽

关键词 ：可比语料库, 语料库构建, 双语术语抽取

Abstract：In this paper, the process of building comparable corpus in special domain for bilingual terminology is designed. Firstly, bilingual sample corpus in a special domain is collected, and keywords are extracted from the sample corpus based on word co-occurrence method. Then, these keywords are used to be a query to a scholar search engine, and the searched result is used to be candidate comparable corpus. Finally, the comparable corpus in the special domain is obtained after filtering noise documents by quantitative evaluation.

Key words： Comparable corpus Corpus construction Bilingual terminology extraction

收稿日期: 2012-01-04 出版日期: 2012-03-23

TP391

基金资助:

本文系国家自然科学基金项目“基于可比语料的多语言文本聚类研究”(项目编号:70903032)和南京理工大学自主科研专项计划项目“多语言标签聚类研究”(项目编号:2011ZDJH15)的研究成果之一。

引用本文:

康小丽, 章成志. 用于双语术语抽取的专业领域中英文可比语料库构建[J]. 现代图书情报技术, 2012, 28(2): 28-33.
Kang Xiaoli, Zhang Chengzhi. Chinese-English Comparable Corpus Construction for Bilingual Terminology Extraction. New Technology of Library and Information Service, 2012, 28(2): 28-33.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2012.02.05 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2012/V28/I2/28

[1] 孙广范,宋金平,袁琦,等.中英可比语料库中翻译等价对抽取方法研究[J]. 计算机工程与应用 , 2007, 43(32): 44-46.(Sun Guangfan,Song Jinping,Yuan Qi,et a1.Research on Extraction of Translation Equivalents from Chinese-English Comparable Corpus[J].Computer Engineering and Applications,2007,43(32):44-46.)

[2] Talvensaari T, Laurikkala J, Jarvelin K, et al. Creating and Exploiting a Comparable Corpus in Cross- Language Information Retrieval [J]. ACM Transactions on Information Systems,2007,25(1):322-334.

[3] Talvensaari T, Laurikkala J, Jarvelin K, et al. A Study on Automatic Creation of a Comparable Document Collection in Cross-Language Information Retrieval [J]. Journal of Documentation,2006,62(3):372-387.

[4] Gliozzo A, Strapparava C. Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization [C]. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics,2006:553-560.

[5] Do T N D, Le V B, Bigi B, et al. Mining a Comparable Text Corpus for a Vietnamese-French Statistical Machine Translation System[C].In:Proceedings of the 4th Workshop on Statistical Machine Translation. Stroudsburg, PA:Association for Computational Linguistics, 2009:165-172.

[6] Sheridan P, Ballerini J P. Experiments in Multilingual Information Retrieval Using the SPIDER System [C]. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York : Association for Computing Machinery,1996:58-65.

[7] Braschler M, Schäuble P. Multilingual Information Retrieval Based on Document Alignment Techniques [C]. In: Proceedings of the 2nd European Conference on Research and Advanced Technology for Digital Libraries. London, UK:Springer-Verlag,1998:183-197.

[8] Leturia I, Vicente I S,Saralegi X. Search Engine Based Approaches for Collecting Domain-Specific Basque-English Comparable Corpora from the Internet [C].In: Proceedings of the 5th Web as Corpus Workshop (WAC5). San Sebastian: Elhuyar Fundazioa, 2009:53-61.

[9] Pirkola A,Leppanen E,Jarvelin K.The RATF Formula (Kwok's Formula): Exploiting Average Term Frequency in Cross-Language Retrieval[J/OL]. Information Research,2002,7(2). [2010-01-05].http://InformationR.net/ir/7-2/infres72.html.

[10] Keskustalo H, Hedlund T, Airio E. UTACLIR-General Query Translation Framework for Several Language Pairs[C]. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York : Association for Computing Machinery, 2002:448.

[11] Lemur.The Lemur Toolkit for Language Modeling and Information Retrieval [EB/OL]. (2009-12-21).[2010-01-05].http://www.lemurproject.org/.

[12] Collier N, Kumano A, Hirakawa H. An Application of Local Relevance Feedback for Building Comparable Corpora from News Article Matching [J]. Natl Inst Inform, 2003(5):9-23.

[13] Rogati M, Yang Y M.CMU PRF Using a Comparable Corpus: CLEF Working Notes [C]. In: Proceedings of Working Notes for the Cross-Language Evaluation Forum(CLEF 2001) Workshop. Berlin:Springer-Verlag, 2001:81-86.

[14] Layiosa-Braithwaits S. Ensino Das Linguas Vivas no Superior em Portugal [M].Porto: Faculdade de Letras da Universidade do Porto,1999:307-317.

[15] Talvensaari T, Pirkola A, Jaervelin K, et al. Focused Web Crawling in the Acquisition of Comparable Corpora [J]. Information Retrieval,2008,11(5):427-445.

[16] Baroni M, Bernardini S. BootCaT: Bootstrapping Corpora and Terms from the Web[C]. In: Proceedings of International Conference on Language Resources and Evaluation(LREC2004). Paris:European Language Resources Association, 2004:1313-1316.

[17] Rapp R.Identifying Word Translations in Non-parallel Texts[C].In: Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics,1995:320-322.

[18] Tanaka K, Iwasaki H.Extraction of Lexical Translations from Non-aligned Corpora[C]. In: Proceedings of the 16th International Conference on Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics,1996: 580-585.

[19] Shahzad I, Ohtake K, Masuyama S, et al. Identifying Translations of Compound Nouns Using Non-aligned Corpora[C].In: Proceedings of the Workshop on Multilingual Information Processing and Asian Language Processing. San Francisco:Morgan Kaufmann Publishers,1999: 108-113.

[20] Fung P. A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora[C]. In: Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup. LNAI 1529, Berlin: Springer-Verlag,1998:1-17.

[21] Sanderson M,Croft B. Deriving Concept Hierarchies from Text [C]. In: Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:Association for Computing Machinery,1999:206-213.

[22] 祝清松, 王惠临.中英文句法分析系统及验证平台的设计与实现[J]. 现代图书情报技术 ,2010 (2):38-43.(Zhu Qingsong,Wang Huilin.A Syntactic Analysis System and Verification Platform for Chinese and English [J]. New Technology of Library and Information Service,2010(2):38-43.)

[23] Pearson J. Terms in Context [M]. Admsterdam: John Benjamins Publishing Company,1998:123-124.

[24] 夏云,李德凤.可比语料量化比较分析与应用文体翻译——一项基于自建小型语料库的研究[C].见: 第18届世界翻译大会论文集 .北京:外文出版社,2008: 561-566.(Xia Yun,Li Defeng. Quatitative Comparative Analysis of Comparable Corpus and Translation of Practical Style-A Research Based on Self-built Small Corpus[C].In: Proceedings of the 18 FIT World Congress Proceedings. Beijing:Foreign Languages Press,2008: 561-566.)

[25] Salton G, McGill M J. Introduction to Modern Information Retrieval [M]. New York: McGraw Hill Book Company, 1983:201-203.

[1]	冯冠军, 禹龙, 田生伟. 基于CRFs自动构建维吾尔语情感词语料库[J]. 现代图书情报技术, 2011, 27(3): 17-21.
[2]	康小丽,章成志,王惠临. 基于可比语料库的双语术语抽取研究述评*[J]. 现代图书情报技术, 2009, (10): 7-13.

Viewed

Full text

Abstract

Cited

Shared

Discussed