|
|
Targeted Websites Harvest System Based on Nutch |
Xu Jian1,2 Zhang Zhixiong 1 |
1(National Science Library, Chinese Academy of Sciences, Beijing 100190, China)
2(Department of Information Management, Sun Yat-Sen University, Guangzhou 510275, China) |
|
|
Abstract The paper analyzes typical open source Web crawl software, such as Nutch, Heritrix, WCT, and Web-Harvest. Following the analyzed result, it puts forward a targeted websites harvest system based on Nutch. Four key issues of this system are discussed emphatically, which are the initial seed websites selection, the harvest process management, the web page content denoising, and discovering of new seed websites.
|
Received: 17 February 2009
Published: 25 April 2009
|
|
Corresponding Authors:
Xu Jian
E-mail: xujian@mail.las.ac.cn
|
About author:: Xu Jian,Zhang Zhixiong |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|