Please wait a minute...
New Technology of Library and Information Service  2009, Vol. 25 Issue (12): 47-51    DOI: 10.11925/infotech.1003-3513.2009.12.09
article Current Issue | Archive | Adv Search |
Automatic Building of Sentence-Level English-Chinese Parallel Corpus
Wang Dongbo   Su Xinning
(Department of Information Management, Nanjing University, Nanjing 210093, China)
Download: PDF(512 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

This article gives an account of the steps of how to automatically build a large-scale sentence-level English-Chinese parallel corpus based on websites. Specifically speaking, the following questions are addressed: the criterions which are used to grab websites are set and words library is worked out; the websites are automatically grabbed by making use of the tool ‘Wget’; the English-Chinese parallel sentences extracted from websites are subsequently processed and the Chinese sentences are segmented based on Conditional Random Field. Finally, the building of English-Chinese parallel corpus is completed which includes 1 017 963 English-Chinese parallel sentences stored in database which are automatically extracted from 675 308 websites.

Key wordsEnglish-Chinese parallel corpus      Wget      Words library      Conditional random field     
Received: 30 November 2009      Published: 25 December 2009
: 

TP391

 
Corresponding Authors: Wang Dongbo     E-mail: jisuanyuyan@163.com
About author:: Wang Dongbo,Su Xinning

Cite this article:

Wang Dongbo,Su Xinning. Automatic Building of Sentence-Level English-Chinese Parallel Corpus. New Technology of Library and Information Service, 2009, 25(12): 47-51.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2009.12.09     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2009/V25/I12/47

1] 王克非.双语对应语料库研制与应用[M].北京:外语教学与研究出版社,2004:232-233.
[2] 程岚岚.基于正则表达式的大规模网页术语对抽取研究[J].情报杂志,2008,27(11):62-63.
[3] Zhang  Y, Vines  P.Using the Web for Automated Translation Extraction in Cross-language Information Retrieval[C]. In: Proceedings of SIGIR. Sheffield: University of Sheffield, 2004: l62-167.
[4] Huang F,Zhang Y,Vogel S. Mining Key Phrase Translations from Web Corpora[C]In:Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing,Vancouver, British Columbia, Canada. Morristown, NJ, USA: Association for Computational Linguistics,  2005:483 - 490.
[5] 张永臣,孙乐,李飞,等.基于Web数据的特定领域双语词典抽取[J].中文信息学报,2006,20(2):16-23.
[6] 王丽,王同顺.中国英语学习者语用标记语习得研究——一项基于SECCL和BNC的实证研究[J].现代外语,2008,31(3):294.
[7] Wget Manual[EB/OL].[2009-12-06].http://www.gnu.org/software/wget/manual/wget.html.
[8] Ma X, Liberman M. BITS:A Method for Bilingual Text Search over the Web[C]. In: Proceedings of Machine Translation Summit VII. Singapore: National University of Singapore,1999.
[9] 章成敏,许鑫,章成志.条件随机场标引模型的性能影响因素分析[J].现代图书情报技术,2008 (6):34-40.
[10] 李双龙,刘群.基于条件随机场的汉语分词系统[J].软件天地,2006(10):178-179.
[11] The Features of CRF++[EB/OL].[2009-12-06]. http://crfpp.sourceforge.net/#features.
[12] Definition of  MySQL[EB/OL].[2009-12-06]. http://en.wikipedia.org/wiki/MySQL.

[1] Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[2] Huihui Tang,Hao Wang,Zixuan Zhang,Xueying Wang. Extracting Names of Historical Events Based on Chinese Character Tags[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[3] Xiaoyu Wang,Bin Li. Automatically Segmenting Middle Ancient Chinese Words with CRFs[J]. 数据分析与知识发现, 2017, 1(5): 62-70.
[4] Dongbo Wang,Yi Wu,Wenhao Ye,Ruilun Liu. Extracting Events of Food Safety Emergencies with Characteristics Knowledge[J]. 数据分析与知识发现, 2017, 1(3): 54-61.
[5] Yue Zhang,Dongbo Wang,Danhao Zhu. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[6] Lin Zhang,Ce Qin,Wenhao Ye. Automatic Recognition of Legal Language Entities Based on Conditional Random Fields[J]. 数据分析与知识发现, 2017, 1(11): 46-52.
[7] He Huixin,Liu Lijuan. A Scientific Research Object Labeling System Based on Active earning[J]. 现代图书情报技术, 2016, 32(3): 67-73.
[8] Jiang Chuntao. Automatic Annotation of Bibliographical References in Chinese Patent Documents[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[9] He Yu, Lv Xueqiang, Xu Liping. A Chinese Term Extraction System in New Energy Vehicles Domain[J]. 现代图书情报技术, 2015, 31(10): 88-94.
[10] Zeng Zhen, Lv Xueqiang, Li Zhuo. The Automatic Identification of Chinese Names in Query Logs[J]. 现代图书情报技术, 2014, 30(12): 71-77.
[11] Tang Yafen. Research of Automatically Recognizing Name in Pre-Qin Ancient Chinese Classics[J]. 现代图书情报技术, 2013, 29(7/8): 63-68.
[12] Xiong Wenxin. Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus[J]. 现代图书情报技术, 2013, (6): 36-41.
[13] Lin Chen, Wang Lancheng. Object Recognition of Network Comments Based on Conditional Random Fields[J]. 现代图书情报技术, 2013, (6): 63-67.
[14] Wang Dongbo, Han Pu, Shen Si, Wei Xiangqing. Research of Mining the Category Knowledge Based on English-Chinese Humanities and Social Sciences Parallel Corpus in Phrase Level[J]. 现代图书情报技术, 2012, (11): 40-46.
[15] Gao Qiang, You Hongliang. Study on Named Entity Recognition Based on Cascaded Model for Field of Defense[J]. 现代图书情报技术, 2012, (11): 47-52.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn