Please wait a minute...
New Technology of Library and Information Service  2007, Vol. 2 Issue (11): 58-62    DOI: 10.11925/infotech.1003-3513.2007.11.12
Current Issue | Archive | Adv Search |
Principle Research and Architecture Design of Focused Crawler Based on WebSPHINX
Bai Guangzu  Lv Junsheng
1(The Lanzhou Branch of the National Science Library,CAS,Lanzhou 730000, China )
2(Graduate University of CAS, Beijing 100049,China)
Download: PDF(562 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

After discussing the origin, basic principles and architecture of the focused crawler development, the authors analyse features of the WebSPHINX, then design a focused crawler based on WebSPHINX.

Key wordsFocused crawler      WebSPHINX      Architecture     
Received: 27 September 2007      Published: 25 November 2007
: 

TP391.3

 
Corresponding Authors: Bai Guangzu     E-mail: bmw6809@163.com
About author:: Bai Guangzu,Lv Junsheng

Cite this article:

Bai Guangzu,Lv Junsheng. Principle Research and Architecture Design of Focused Crawler Based on WebSPHINX. New Technology of Library and Information Service, 2007, 2(11): 58-62.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2007.11.12     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2007/V2/I11/58

[1] Bergman M K.Six Major Trends Affecting Knowledge Management and Information Technology[R].White Paper Published by BrightPlanet Corporation, July,2003.
[2] Aggarwal C, Ai-Garawi F, Yu P.Intelligent Crawling on the World Wide Web with Arbitrary Predicates[C].In: Proceedings of the 10th International World Wide Web Conference, 2001.
[3] Brin S,Page L. The Anatomy of a Large-scale Hypertextual Web Search Engine[C]. In: Proceedings of the Seventh International World Wide Web Conference,1998.
[4] 李春旺. 基于OSS的主题搜索引擎设计与实现[J].现代图书情报技术, 2007,(1):49-52.
[5] WebSPHINX: A Personal,Customizable Web crawler[EB/OL].[2007-08-02].http://www.cs.cmu.edu/~rcm/websphinx/.
[6] Greenstein D.Draft Report of a Meeting Convened by the Digital Library Federation on October 5-6,2001 in Washington DC to Consider Open Source Software for Libraries.[2007-08-02]. http://www.Diglib.org/architectures/ossrep.htm.
[7] Websphinx.zip[CP/OL]. [2007-08-02].http://www.cs.cmu.edu/~rcm/websphinx/.
[8] 李春旺. Web信息主题采集技术研究[J].图书情报工作,2005,49 (4):77-80.
[9] 李盛韬. 基于主题的Web信息采集技术研究[D].中国科学院研究生院,2002.
[10] Apache Lucene.[2007-08-02].http://lucene.apache.org/java/docs/.
[11] Menczer F, Pant G, Srinivasan P.Topic—Driven crawlers:machine learning issues[EB/OL]. (2004-07-02).[2007-08-02]. http://www.informatics.indiana.edu/fil/papers.asp.

[1] Jing Xie,Li Qian,Hongbo Shi,Beibei Kong,Jiying Hu. Designing Framework for Precise Service of Scholarly Big Data[J]. 数据分析与知识发现, 2019, 3(1): 63-71.
[2] Ma Yumeng, Guo Jinjing, Wang Fang. Research on the Framework of Semantic Organization Model for Research Data in the e-Science Environment[J]. 现代图书情报技术, 2015, 31(7-8): 48-57.
[3] Qiao Jianzhong. An Improved Best-First Search Algorithm Based Focused Crawling Research[J]. 现代图书情报技术, 2013, 29(7/8): 28-35.
[4] Qiao Jianzhong. Statistical Characteristics Based Web Page Relevance Judgment Strategy for the “Type” Topics Crawled[J]. 现代图书情报技术, 2012, 28(6): 9-16.
[5] Wu Hong, Wang Fengying, Fu Xiuying. Design and Establishment of Legal Status Distributed Collection System Based on Patent Analysis[J]. 现代图书情报技术, 2012, (12): 66-71.
[6] Huang Wei, Jin Yabo, Hu Changlong. Focused Crawling for Network Public Opinion’s Topic Information[J]. 现代图书情报技术, 2012, (11): 65-71.
[7] Yang Rui, Tang Yijie, Liu Yi, Li Wei. Comprehensive Evaluation of the Ontology Building System in the Web Environment[J]. 现代图书情报技术, 2012, 28(1): 13-18.
[8] Qiao Jianzhong. Anchor and Link Text Expansion Based KBES Algorithm Tunneling Strategy[J]. 现代图书情报技术, 2011, 27(3): 45-50.
[9] Huang Wei,Zhang Liyi. Research on Focused Merchandise Information Crawling Based on Semantic Crawler[J]. 现代图书情报技术, 2010, 26(1): 3-8.
[10] Ma Jianxia,Paolo Manghi,Wolfram Horstmann,Friedrich Summann. Analysis of the Service-Oriented Digital Repository Architecture ——DNET[J]. 现代图书情报技术, 2010, 26(1): 34-40.
[11] Ji Shanshan,Li Yu,Zhou Qiang. Research on Mashup Tools[J]. 现代图书情报技术, 2010, 26(1): 41-45.
[12] Yao Fei,Chen Wu,Zhao Yang. Architecture Design and Implementation of English Website of Tsinghua University Library[J]. 现代图书情报技术, 2009, 3(3): 91-95.
[13] Wu Zhenxin,Xiang Jing. Analysis of Retrieval System Architecture in Web Archive[J]. 现代图书情报技术, 2009, 3(1): 22-27.
[14] Liu Rongfa. FC-SAN Storage System Construction and Performance Optimization for Digital Library[J]. 现代图书情报技术, 2008, 24(7): 70-74.
[15] Li Chunwang,Xiao Wei. Mashup: Concept, Architecture and Application[J]. 现代图书情报技术, 2008, 24(12): 22-26.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn