esp@cenet as an Example" /> esp@cenet as an Example" /> esp@cenet as an Example" /> 基于Web的专利双语语料自动获取研究及实现*——以esp@cenet数据库为例
Please wait a minute...
New Technology of Library and Information Service  2009, Vol. Issue (9): 57-63    DOI: 10.11925/infotech.1003-3513.2009.09.10
Current Issue | Archive | Adv Search |
Research and Implement of Automatic Patent Bilingual Corpus Extraction from Web——Taking esp@cenet as an Example
Wu LinWei XingHuo Cuiting3
1(Institute of Scientific & Technical Information of China, Beijing 100038, China)
2(School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China)
3(Wanfang Data Co.Ltd, Beijing 100038, China)
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

This paper introduces the research of an available method to automatically extract high quality translation pairs from patent database for patent resources. It analyzes the features of URLs to extract detail Web pages of patent data for batch downloading, and then uses regular expression matches to extract necessary information from Web pages through page parsed. At last, it makes bilingual parallel corpus after merging data.

Key wordsPatent      Bibliographic information      Bilingual parallel corpus      Pages parsed     
Received: 27 July 2009      Published: 25 September 2009
: 

TP391

 
Corresponding Authors: Wu Lin     E-mail: suecky@126.com
About author:: Wu Lin,Wei Xing,Huo Cuiting

Cite this article:

Wu Lin,Wei Xing,Huo Cuiting. Research and Implement of Automatic Patent Bilingual Corpus Extraction from Web——Taking esp@cenet as an Example. New Technology of Library and Information Service, 2009, (9): 57-63.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2009.09.10     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2009/V/I9/57

[1] Zhang Y, Vines P. Using the Web for Automated Translation Extraction in Cross-language Information Retrieval [C].In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.2004:162-169.
[2] Huang F, Zhang Y,Vogel S. Mining Key Phrase Translations from Web Corpora[C]. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. 2005:483-490.
[3] Resnik P. Parallel Strands: A Preliminary Investigation into Mining the Web for Bilingual Text[C].In:Proceedings of the 3rd Conference of the Association for Machine Translation,America.1998: 72-82.
[4] Resnik P, Smith N A.The Web as a Parallel Corpus[J]. Computational Linguistics, 2003,29(3):349-380.
[5]黄继东.Internet上的免费专利数据库及其检索[J].情报科学,2001,19(12): 1284-1286.
[6]李湖生.中外四大官方网站免费专利检索系统之比较研究[J].图书馆理论与实践,2008(1):16-18,52.
[7]欧洲专利局数据库[EB/OL]. [2008-05-07].http://ep.espace.net.com.
[8]叶莎妮,吕雅娟,黄赟,等.基于Web的双语平行句对自动获取[J].中文信息学报,2008,22(5):67-73.
[9]专利文献种类标识代码标准[J].电子知识产权,2004(4): 62-63.
[10] Chen J, Nie J Y. Automatic Construction of Parallel English-Chinese Corpus for Cross-language Information Retrieval[C].In:Proceedings of the International Conference on Chinese Language Computing,San Francisco.2000: 21-28.

[1] Zhang Le, Leng Jidong, Lv Xueqiang, Cui Zhuo, Wang Lei, You Xindong. RLCPAR: A Rewriting Model for Chinese Patent Abstracts Based on Reinforcement Learning[J]. 数据分析与知识发现, 2021, 5(7): 59-69.
[2] Gao Yilin,Min Chao. Comparing Technology Diffusion Structure of China and the U.S. to Countries Along the Belt and Road[J]. 数据分析与知识发现, 2021, 5(6): 80-92.
[3] Lv Xueqiang,Luo Yixiong,Li Jiaquan,You Xindong. Review of Studies on Detecting Chinese Patent Infringements[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[4] Chen Hao, Zhang Mengyi, Cheng Xiufeng. Identifying Cross-Region Patent Collaboration Opportunities Using LDA and Decision Trees——Case Study of Universities from Guangdong and Wuhan[J]. 数据分析与知识发现, 2021, 5(10): 37-50.
[5] Guan Peng,Wang Yuefen,Jin Jialin,Fu Zhu. Developments of Tech-Innovation Network for Patent Cooperation: Case Study of Speech Recognition in China[J]. 数据分析与知识发现, 2021, 5(1): 112-127.
[6] Hu Yongjun,Wei Tingting,Dou Zixin,Huang Yunyin,Liang Ruicheng,Chang Huiyou. Tech-Development Path of Knife-Scissor Industry in Guangdong with TRIZ Analysis of Patents[J]. 数据分析与知识发现, 2020, 4(2/3): 101-109.
[7] Zhang Jinzhu,Zhu Lipeng,Liu Jingjie. Unsupervised Cross-Language Model for Patent Recommendation Based on Representation[J]. 数据分析与知识发现, 2020, 4(10): 93-103.
[8] Li Jiaquan,Li Baoan,You Xindong,Lü Xueqiang. Computing Similarity of Patent Terms Based on Knowledge Graph[J]. 数据分析与知识发现, 2020, 4(10): 104-112.
[9] Peng Guan,Yuefen Wang. Advances in Patent Network[J]. 数据分析与知识发现, 2020, 4(1): 26-39.
[10] Yan Yu,Lei Chen,Jinde Jiang,Naixuan Zhao. Measuring Patent Similarity with Word Embedding and Statistical Features[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
[11] Jianhua Hou,Pan Liu. Measuring Tech-Entropy of System Evolution: An Empirical Study of Patents[J]. 数据分析与知识发现, 2019, 3(8): 21-29.
[12] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[13] Jinzhu Zhang,Yiming Hu. Extracting Titles from Scientific References in Patents with Fusion of Representation Learning and Machine Learning[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
[14] Jie Zhang,Junbo Zhao,Dongsheng Zhai,Ningning Sun. Patent Technology Analysis of Microalgae Biofuel Industrial Chain Based on Topic Model[J]. 数据分析与知识发现, 2019, 3(2): 52-64.
[15] Jinzhu Zhang,Yue Wang,Yiming Hu. Analyzing Sci-Tech Topics Based on Semantic Representation of Patent References[J]. 数据分析与知识发现, 2019, 3(12): 52-60.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn