1(National Science Library, Chinese Academy of Sciences, Beijing 100190, China) 2(Department of Information Management, Sun Yat-Sen University, Guangzhou 510275, China)
The paper analyzes typical open source Web crawl software, such as Nutch, Heritrix, WCT, and Web-Harvest. Following the analyzed result, it puts forward a targeted websites harvest system based on Nutch. Four key issues of this system are discussed emphatically, which are the initial seed websites selection, the harvest process management, the web page content denoising, and discovering of new seed websites.
徐健,张智雄. 基于Nutch的Web网站定向采集系统*[J]. 现代图书情报技术, 2009, 25(4): 1-6.
Xu Jian,Zhang Zhixiong. Targeted Websites Harvest System Based on Nutch. New Technology of Library and Information Service, 2009, 25(4): 1-6.