Research and Implement of Automatic Patent Bilingual Corpus Extraction from Web——Taking esp@cenet as an Example
Wu Lin1 Wei Xing2 Huo Cuiting3
1(Institute of Scientific & Technical Information of China, Beijing 100038, China) 2(School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China) 3(Wanfang Data Co.Ltd, Beijing 100038, China)
This paper introduces the research of an available method to automatically extract high quality translation pairs from patent database for patent resources. It analyzes the features of URLs to extract detail Web pages of patent data for batch downloading, and then uses regular expression matches to extract necessary information from Web pages through page parsed. At last, it makes bilingual parallel corpus after merging data.
吴琳,魏星,霍翠婷. 基于Web的专利双语语料自动获取研究及实现*——以esp@cenet数据库为例[J]. 现代图书情报技术, 2009, (9): 57-63.
Wu Lin,Wei Xing,Huo Cuiting. Research and Implement of Automatic Patent Bilingual Corpus Extraction from Web——Taking esp@cenet as an Example. New Technology of Library and Information Service, 2009, (9): 57-63.
[1] Zhang Y, Vines P. Using the Web for Automated Translation Extraction in Cross-language Information Retrieval [C].In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.2004:162-169.
[2] Huang F, Zhang Y,Vogel S. Mining Key Phrase Translations from Web Corpora[C]. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. 2005:483-490.
[3] Resnik P. Parallel Strands: A Preliminary Investigation into Mining the Web for Bilingual Text[C].In:Proceedings of the 3rd Conference of the Association for Machine Translation,America.1998: 72-82.
[4] Resnik P, Smith N A.The Web as a Parallel Corpus[J]. Computational Linguistics, 2003,29(3):349-380.
[5]黄继东.Internet上的免费专利数据库及其检索[J].情报科学,2001,19(12): 1284-1286.
[6]李湖生.中外四大官方网站免费专利检索系统之比较研究[J].图书馆理论与实践,2008(1):16-18,52.
[7]欧洲专利局数据库[EB/OL]. [2008-05-07].http://ep.espace.net.com.
[8]叶莎妮,吕雅娟,黄赟,等.基于Web的双语平行句对自动获取[J].中文信息学报,2008,22(5):67-73.
[9]专利文献种类标识代码标准[J].电子知识产权,2004(4): 62-63.
[10] Chen J, Nie J Y. Automatic Construction of Parallel English-Chinese Corpus for Cross-language Information Retrieval[C].In:Proceedings of the International Conference on Chinese Language Computing,San Francisco.2000: 21-28.