After discussing the origin, basic principles and architecture of the focused crawler development, the authors analyse features of the WebSPHINX, then design a focused crawler based on WebSPHINX.
白光祖,吕俊生. 基于WebSPHINX的主题搜索引擎原理研究与结构设计[J]. 现代图书情报技术, 2007, 2(11): 58-62.
Bai Guangzu,Lv Junsheng. Principle Research and Architecture Design of Focused Crawler Based on WebSPHINX. New Technology of Library and Information Service, 2007, 2(11): 58-62.
[1] Bergman M K.Six Major Trends Affecting Knowledge Management and Information Technology[R].White Paper Published by BrightPlanet Corporation, July,2003.
[2] Aggarwal C, Ai-Garawi F, Yu P.Intelligent Crawling on the World Wide Web with Arbitrary Predicates[C].In: Proceedings of the 10th International World Wide Web Conference, 2001.
[3] Brin S,Page L. The Anatomy of a Large-scale Hypertextual Web Search Engine[C]. In: Proceedings of the Seventh International World Wide Web Conference,1998.
[4] 李春旺. 基于OSS的主题搜索引擎设计与实现[J].现代图书情报技术, 2007,(1):49-52.
[5] WebSPHINX: A Personal,Customizable Web crawler[EB/OL].[2007-08-02].http://www.cs.cmu.edu/~rcm/websphinx/.
[6] Greenstein D.Draft Report of a Meeting Convened by the Digital Library Federation on October 5-6,2001 in Washington DC to Consider Open Source Software for Libraries.[2007-08-02]. http://www.Diglib.org/architectures/ossrep.htm.
[7] Websphinx.zip[CP/OL]. [2007-08-02].http://www.cs.cmu.edu/~rcm/websphinx/.
[8] 李春旺. Web信息主题采集技术研究[J].图书情报工作,2005,49 (4):77-80.
[9] 李盛韬. 基于主题的Web信息采集技术研究[D].中国科学院研究生院,2002.
[10] Apache Lucene.[2007-08-02].http://lucene.apache.org/java/docs/.
[11] Menczer F, Pant G, Srinivasan P.Topic—Driven crawlers:machine learning issues[EB/OL]. (2004-07-02).[2007-08-02]. http://www.informatics.indiana.edu/fil/papers.asp.