|
|
Principle Research and Architecture Design of Focused Crawler Based on WebSPHINX |
Bai Guangzu Lv Junsheng |
1(The Lanzhou Branch of the National Science Library,CAS,Lanzhou 730000, China )
2(Graduate University of CAS, Beijing 100049,China) |
|
|
Abstract After discussing the origin, basic principles and architecture of the focused crawler development, the authors analyse features of the WebSPHINX, then design a focused crawler based on WebSPHINX.
|
Received: 27 September 2007
Published: 25 November 2007
|
|
Corresponding Authors:
Bai Guangzu
E-mail: bmw6809@163.com
|
About author:: Bai Guangzu,Lv Junsheng |
[1] Bergman M K.Six Major Trends Affecting Knowledge Management and Information Technology[R].White Paper Published by BrightPlanet Corporation, July,2003.
[2] Aggarwal C, Ai-Garawi F, Yu P.Intelligent Crawling on the World Wide Web with Arbitrary Predicates[C].In: Proceedings of the 10th International World Wide Web Conference, 2001.
[3] Brin S,Page L. The Anatomy of a Large-scale Hypertextual Web Search Engine[C]. In: Proceedings of the Seventh International World Wide Web Conference,1998.
[4] 李春旺. 基于OSS的主题搜索引擎设计与实现[J].现代图书情报技术, 2007,(1):49-52.
[5] WebSPHINX: A Personal,Customizable Web crawler[EB/OL].[2007-08-02].http://www.cs.cmu.edu/~rcm/websphinx/.
[6] Greenstein D.Draft Report of a Meeting Convened by the Digital Library Federation on October 5-6,2001 in Washington DC to Consider Open Source Software for Libraries.[2007-08-02]. http://www.Diglib.org/architectures/ossrep.htm.
[7] Websphinx.zip[CP/OL]. [2007-08-02].http://www.cs.cmu.edu/~rcm/websphinx/.
[8] 李春旺. Web信息主题采集技术研究[J].图书情报工作,2005,49 (4):77-80.
[9] 李盛韬. 基于主题的Web信息采集技术研究[D].中国科学院研究生院,2002.
[10] Apache Lucene.[2007-08-02].http://lucene.apache.org/java/docs/.
[11] Menczer F, Pant G, Srinivasan P.Topic—Driven crawlers:machine learning issues[EB/OL]. (2004-07-02).[2007-08-02]. http://www.informatics.indiana.edu/fil/papers.asp. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|