|
|
Automatic Building of Sentence-Level English-Chinese Parallel Corpus |
Wang Dongbo Su Xinning |
(Department of Information Management, Nanjing University, Nanjing 210093, China) |
|
|
Abstract This article gives an account of the steps of how to automatically build a large-scale sentence-level English-Chinese parallel corpus based on websites. Specifically speaking, the following questions are addressed: the criterions which are used to grab websites are set and words library is worked out; the websites are automatically grabbed by making use of the tool ‘Wget’; the English-Chinese parallel sentences extracted from websites are subsequently processed and the Chinese sentences are segmented based on Conditional Random Field. Finally, the building of English-Chinese parallel corpus is completed which includes 1 017 963 English-Chinese parallel sentences stored in database which are automatically extracted from 675 308 websites.
|
Received: 30 November 2009
Published: 25 December 2009
|
|
Corresponding Authors:
Wang Dongbo
E-mail: jisuanyuyan@163.com
|
About author:: Wang Dongbo,Su Xinning |
1] 王克非.双语对应语料库研制与应用[M].北京:外语教学与研究出版社,2004:232-233.
[2] 程岚岚.基于正则表达式的大规模网页术语对抽取研究[J].情报杂志,2008,27(11):62-63.
[3] Zhang Y, Vines P.Using the Web for Automated Translation Extraction in Cross-language Information Retrieval[C]. In: Proceedings of SIGIR. Sheffield: University of Sheffield, 2004: l62-167.
[4] Huang F,Zhang Y,Vogel S. Mining Key Phrase Translations from Web Corpora[C]In:Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing,Vancouver, British Columbia, Canada. Morristown, NJ, USA: Association for Computational Linguistics, 2005:483 - 490.
[5] 张永臣,孙乐,李飞,等.基于Web数据的特定领域双语词典抽取[J].中文信息学报,2006,20(2):16-23.
[6] 王丽,王同顺.中国英语学习者语用标记语习得研究——一项基于SECCL和BNC的实证研究[J].现代外语,2008,31(3):294.
[7] Wget Manual[EB/OL].[2009-12-06].http://www.gnu.org/software/wget/manual/wget.html.
[8] Ma X, Liberman M. BITS:A Method for Bilingual Text Search over the Web[C]. In: Proceedings of Machine Translation Summit VII. Singapore: National University of Singapore,1999.
[9] 章成敏,许鑫,章成志.条件随机场标引模型的性能影响因素分析[J].现代图书情报技术,2008 (6):34-40.
[10] 李双龙,刘群.基于条件随机场的汉语分词系统[J].软件天地,2006(10):178-179.
[11] The Features of CRF++[EB/OL].[2009-12-06]. http://crfpp.sourceforge.net/#features.
[12] Definition of MySQL[EB/OL].[2009-12-06]. http://en.wikipedia.org/wiki/MySQL. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|