|
|
Design and Implementation of Focused Crawler Based on OSS |
Li Chunwang |
(Library of Chinese Academy of Sciences, Beijing 100080, China) |
|
|
Abstract After analyzing the architecture of a focused crawler and its implemented strategies based on OSS, this paper emphatically discusses subject modeling and related algorithms, and explains the detailed integration technologies which includes the same Java standards, Web services and Java Native Interface (JNI).
|
Received: 10 November 2006
Published: 25 January 2007
|
|
Corresponding Authors:
Li Chunwang
E-mail: licw@mail.las.ac.cn
|
About author:: Li Chunwang |
1Chakrabarti S, Punera K, Subramanyam M. Accelerated focused crawling through online relevance feedback, WWW2002, May 7-11,2002,Honolulu, Hawaii,USA. http://www.cs.berkeley.edu/~soumen/doc/www2002m/p336-chakrabarti.pdf (Accessed Nov. 8, 2006)
2Mitchell S, Mooney M, Mason J, et al. iVia open source virtual library system.D-LibMagazine,2003,9(1).http://www.dlib.org/dlib/january03/mitchell/01mitchell.html (Accessed Nov. 8, 2006)
3INFOMINE: scholarly internet resource collections. http://infomine.ucr.edu/ (Accessed Nov. 8, 2006)
4Bergman M K. Six major trends affecting knowledge management and information technology.White paper published by BrightPlanet Corporation, July 2003
5Anthes G. Search engines——the future. http://www.computerworld.com/printthis/2004/0,4814,91841,00.html (Accessed Nov. 8, 2006)
6李春旺. Web信息主题采集技术研究.图书情报工作,2005,49(4):77-80,70
7JSpider - the open source Web robot. http://j-spider.sourceforge.net/ (Accessed Nov. 8, 2006)
8WebSPHINX a personal,customizable Web crawler. http://www.cs.cmu.edu/~rcm/websphinx/ (Accessed Nov. 8, 2006)
9WebLech URL spider. http://weblech.sourceforge.net/ (Accessed Nov. 8, 2006)
10Greenstein D. Draft report of a meeting convened by the digital library federation on October 5-6, 2001 in Washington DC to consider Open Source Software for Libraries. October 22, 2001. http://www.diglib.org/architectures/ossrep.htm (Accessed Nov. 8, 2006)
11Robert C. Miller,Krishna Bharat. SPHINX: a framework for creating personal, site-specific Web crawlers. Computer Network and ISDN Systems, 1998(30):119-130
12Ehrig M. Ontology - focused crawling of documents and relational metadata.(Master thesis). University of Karlsruhe, Germany,2002. http://www2002.org/CDROM/poster/94/ (Accessed Nov. 8, 2006)
13Ehrig M, Maedche A. Ontology-focused crawling of Web documents.http://www.aifb.uni-karlsruhe.de/WBS/meh/publications/ehrig03ontology.pdf (Accessed Nov. 8, 2006)
14Gawrysiak P. Using data mining methodology for text retrieval. http://bolek.ii.pw.edu.pl/~gawrysia/publ/DIBSarticle.pdf (Accessed Nov. 8, 2006)
15Clever System (HITS) - A page ranking algorithm developed by IBM. http://www.ecsl.cs.sunysb.edu/~chiueh/cse646/cn4/cn4.html (Accessed Nov. 8, 2006)
16Multivalent. http://multivalent.sourceforge.net/ (Accessed Nov. 8, 2006)
17Apache Lucene.http://lucene.apache.org/java/docs/ (Accessed Nov. 8, 2006)
18TextCat language guesser. http://www.let.rug.nl/~vannoord/TextCat/ (Accessed Nov. 8, 2006)
19Cavnar W B,Trenkle J M.N-gram-based text categorization.In Proceedings of Third Annual Symposiumon Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics,1994(4):161-175
20计算所汉语词法分析系统ICTCLAS. http://www.nlp.org.cn/project/project.php?proj_id=6.2004-12-05 (Accessed Nov. 8, 2006) |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|