Please wait a minute...
New Technology of Library and Information Service  2009, Vol. 25 Issue (4): 1-6    DOI: 10.11925/infotech.1003-3513.2009.04.01
article Current Issue | Archive | Adv Search |
Targeted Websites Harvest System Based on Nutch
 Xu Jian1,2   Zhang Zhixiong 1
1(National Science Library, Chinese Academy of Sciences, Beijing 100190, China)
2(Department of Information Management, Sun Yat-Sen University, Guangzhou 510275, China)
Download: PDF(699 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

The paper analyzes typical open source Web crawl software, such as Nutch, Heritrix, WCT, and Web-Harvest. Following the analyzed result, it puts forward a targeted websites harvest system based on Nutch. Four key issues of this system are discussed emphatically, which are the initial seed websites selection, the harvest process management, the web page content denoising, and discovering of new seed websites.

Key wordsTargeted Websites Harvest System      Nutch      Website Crawl      Web Page Denoising     
Received: 17 February 2009      Published: 25 April 2009
: 

G250.76

 
Corresponding Authors: Xu Jian     E-mail: xujian@mail.las.ac.cn
About author:: Xu Jian,Zhang Zhixiong

Cite this article:

Xu Jian,Zhang Zhixiong. Targeted Websites Harvest System Based on Nutch. New Technology of Library and Information Service, 2009, 25(4): 1-6.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2009.04.01     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2009/V25/I4/1

[1] Nutch[EB/OL]. [2009-01-29]. http://wiki.apache.org/nutch/.
[2] Doug Cutting. Nutch, Open-Source Web Search[EB/OL]. [2009-01-29]. http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/www2004.pdf.
[3] Heritrix Introduction[EB/OL]. [2009-01-29]. http://crawler.archive.org/.
[4] The Web Curator Tool Project[EB/OL]. [2009-01-29]. http://webcurator.sourceforge.net/.
[5] Web-Harvest[EB/OL]. [2009-01-29]. http://web-harvest.sourceforge.net/.
[6] Html Parser[EB/OL]. [2009-01-29]. http://htmlparser.sourceforge.net/.
[7] Intute, Best of the Web[EB/OL]. [2009-01-29]. http://www.intute.ac.uk/.
[8] Dmoz Open Directory Project[EB/OL]. [2009-01-29]. http://www.dmoz.org/.
[9] Yahoo! Developer Network[EB/OL]. [2009-01-29]. http://developer.yahoo.com/search/.
[10]Nutch Version 0.8.x Tutorial[EB/OL]. [2009-01-29]. http://lucene.apache.org/nutch/tutorial8.html.

[1] Chang Zhirong,Ma Ziwei,Li Gaohu. Research and Implementation of Nutch-based Website Harvest and Service System in Special Field[J]. 现代图书情报技术, 2010, 26(3): 19-26.
[2] Cui Yuhong. Research on Automatic Archiving System for Institutional Repositories[J]. 现代图书情报技术, 2010, 26(12): 76-80.
[3] Cui Yuhong, Zhang Kui. Research on Building an Open Access Search Engine with Nutch[J]. 现代图书情报技术, 2010, 26(10): 82-86.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn