Please wait a minute...
New Technology of Library and Information Service  2009, Vol. 25 Issue (12): 47-51    DOI: 10.11925/infotech.1003-3513.2009.12.09
article Current Issue | Archive | Adv Search |
Automatic Building of Sentence-Level English-Chinese Parallel Corpus
Wang Dongbo   Su Xinning
(Department of Information Management, Nanjing University, Nanjing 210093, China)
Export: BibTeX | EndNote (RIS)      

This article gives an account of the steps of how to automatically build a large-scale sentence-level English-Chinese parallel corpus based on websites. Specifically speaking, the following questions are addressed: the criterions which are used to grab websites are set and words library is worked out; the websites are automatically grabbed by making use of the tool ‘Wget’; the English-Chinese parallel sentences extracted from websites are subsequently processed and the Chinese sentences are segmented based on Conditional Random Field. Finally, the building of English-Chinese parallel corpus is completed which includes 1 017 963 English-Chinese parallel sentences stored in database which are automatically extracted from 675 308 websites.

Key wordsEnglish-Chinese parallel corpus      Wget      Words library      Conditional random field     
Received: 30 November 2009      Published: 25 December 2009


Corresponding Authors: Wang Dongbo     E-mail:
About author:: Wang Dongbo,Su Xinning

Cite this article:

Wang Dongbo,Su Xinning. Automatic Building of Sentence-Level English-Chinese Parallel Corpus. New Technology of Library and Information Service, 2009, 25(12): 47-51.

URL:     OR

1] 王克非.双语对应语料库研制与应用[M].北京:外语教学与研究出版社,2004:232-233.
[2] 程岚岚.基于正则表达式的大规模网页术语对抽取研究[J].情报杂志,2008,27(11):62-63.
[3] Zhang  Y, Vines  P.Using the Web for Automated Translation Extraction in Cross-language Information Retrieval[C]. In: Proceedings of SIGIR. Sheffield: University of Sheffield, 2004: l62-167.
[4] Huang F,Zhang Y,Vogel S. Mining Key Phrase Translations from Web Corpora[C]In:Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing,Vancouver, British Columbia, Canada. Morristown, NJ, USA: Association for Computational Linguistics,  2005:483 - 490.
[5] 张永臣,孙乐,李飞,等.基于Web数据的特定领域双语词典抽取[J].中文信息学报,2006,20(2):16-23.
[6] 王丽,王同顺.中国英语学习者语用标记语习得研究——一项基于SECCL和BNC的实证研究[J].现代外语,2008,31(3):294.
[7] Wget Manual[EB/OL].[2009-12-06].
[8] Ma X, Liberman M. BITS:A Method for Bilingual Text Search over the Web[C]. In: Proceedings of Machine Translation Summit VII. Singapore: National University of Singapore,1999.
[9] 章成敏,许鑫,章成志.条件随机场标引模型的性能影响因素分析[J].现代图书情报技术,2008 (6):34-40.
[10] 李双龙,刘群.基于条件随机场的汉语分词系统[J].软件天地,2006(10):178-179.
[11] The Features of CRF++[EB/OL].[2009-12-06].
[12] Definition of  MySQL[EB/OL].[2009-12-06].

[1] Zhao Ping,Sun Lianying,Tu Shuai,Bian Jianling,Wan Ying. Identifying Scenic Spot Entities Based on Improved Knowledge Transfer[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
[2] Li Chengliang,Zhao Zhongying,Li Chao,Qi Liang,Wen Yan. Extracting Product Properties with Dependency Relationship Embedding and Conditional Random Field[J]. 数据分析与知识发现, 2020, 4(5): 54-65.
[3] Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[4] Tang Huihui,Wang Hao,Zhang Zixuan,Wang Xueying. Extracting Names of Historical Events Based on Chinese Character Tags[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[5] Wang Xiaoyu,Li Bin. Automatically Segmenting Middle Ancient Chinese Words with CRFs[J]. 数据分析与知识发现, 2017, 1(5): 62-70.
[6] Wang Dongbo,Wu Yi,Ye Wenhao,Liu Ruilun. Extracting Events of Food Safety Emergencies with Characteristics Knowledge[J]. 数据分析与知识发现, 2017, 1(3): 54-61.
[7] Zhang Yue,Wang Dongbo,Zhu Danhao. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[8] Zhang Lin,Qin Ce,Ye Wenhao. Automatic Recognition of Legal Language Entities Based on Conditional Random Fields[J]. 数据分析与知识发现, 2017, 1(11): 46-52.
[9] He Huixin,Liu Lijuan. A Scientific Research Object Labeling System Based on Active earning[J]. 现代图书情报技术, 2016, 32(3): 67-73.
[10] Jiang Chuntao. Automatic Annotation of Bibliographical References in Chinese Patent Documents[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[11] He Yu, Lv Xueqiang, Xu Liping. A Chinese Term Extraction System in New Energy Vehicles Domain[J]. 现代图书情报技术, 2015, 31(10): 88-94.
[12] Zeng Zhen, Lv Xueqiang, Li Zhuo. The Automatic Identification of Chinese Names in Query Logs[J]. 现代图书情报技术, 2014, 30(12): 71-77.
[13] Tang Yafen. Research of Automatically Recognizing Name in Pre-Qin Ancient Chinese Classics[J]. 现代图书情报技术, 2013, 29(7/8): 63-68.
[14] Xiong Wenxin. Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus[J]. 现代图书情报技术, 2013, (6): 36-41.
[15] Lin Chen, Wang Lancheng. Object Recognition of Network Comments Based on Conditional Random Fields[J]. 现代图书情报技术, 2013, (6): 63-67.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938