Please wait a minute...
Advanced Search
现代图书情报技术  2009, Vol. 25 Issue (4): 1-6     https://doi.org/10.11925/infotech.1003-3513.2009.04.01
  专题 本期目录 | 过刊浏览 | 高级检索 |
基于Nutch的Web网站定向采集系统*
徐健1,2   张智雄1
1(中国科学院国家科学图书馆 北京 100190)
2(中山大学资讯管理系 广州 |510275)
Targeted Websites Harvest System Based on Nutch
 Xu Jian1,2   Zhang Zhixiong 1
1(National Science Library, Chinese Academy of Sciences, Beijing 100190, China)
2(Department of Information Management, Sun Yat-Sen University, Guangzhou 510275, China)
全文: PDF (699 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 

在对目前具有代表性的开源网络抓取软件Nutch、Heritrix、WCT、Web-Harvest进行比较分析的基础上,提出基于Nutch的Web网站定向采集系统,并对种子站点的选取、抓取过程管理、网页去噪、新种子站点的发现等关键问题进行重点探讨。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
徐健
张智雄
关键词 网站定向采集系统Nutch网站抓取网页去噪    
Abstract

The paper analyzes typical open source Web crawl software, such as Nutch, Heritrix, WCT, and Web-Harvest. Following the analyzed result, it puts forward a targeted websites harvest system based on Nutch. Four key issues of this system are discussed emphatically, which are the initial seed websites selection, the harvest process management, the web page content denoising, and discovering of new seed websites.

Key wordsTargeted Websites Harvest System    Nutch    Website Crawl    Web Page Denoising
收稿日期: 2009-02-17      出版日期: 2009-04-25
: 

G250.76

 
基金资助:

*本文系国家“十一五”科技支撑计划子课题“网络科技信息监测与评价”(项目编号:2006BAH03B05)的研究成果之一。

通讯作者: 徐健     E-mail: xujian@mail.las.ac.cn
作者简介: 徐健,张智雄
引用本文:   
徐健,张智雄. 基于Nutch的Web网站定向采集系统*[J]. 现代图书情报技术, 2009, 25(4): 1-6.
Xu Jian,Zhang Zhixiong. Targeted Websites Harvest System Based on Nutch. New Technology of Library and Information Service, 2009, 25(4): 1-6.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2009.04.01      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2009/V25/I4/1

[1] Nutch[EB/OL]. [2009-01-29]. http://wiki.apache.org/nutch/.
[2] Doug Cutting. Nutch, Open-Source Web Search[EB/OL]. [2009-01-29]. http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/www2004.pdf.
[3] Heritrix Introduction[EB/OL]. [2009-01-29]. http://crawler.archive.org/.
[4] The Web Curator Tool Project[EB/OL]. [2009-01-29]. http://webcurator.sourceforge.net/.
[5] Web-Harvest[EB/OL]. [2009-01-29]. http://web-harvest.sourceforge.net/.
[6] Html Parser[EB/OL]. [2009-01-29]. http://htmlparser.sourceforge.net/.
[7] Intute, Best of the Web[EB/OL]. [2009-01-29]. http://www.intute.ac.uk/.
[8] Dmoz Open Directory Project[EB/OL]. [2009-01-29]. http://www.dmoz.org/.
[9] Yahoo! Developer Network[EB/OL]. [2009-01-29]. http://developer.yahoo.com/search/.
[10]Nutch Version 0.8.x Tutorial[EB/OL]. [2009-01-29]. http://lucene.apache.org/nutch/tutorial8.html.

[1] 常智荣,马自卫,李高虎. 基于Nutch的专题网页资源采集服务系统的设计与实现[J]. 现代图书情报技术, 2010, 26(3): 19-26.
[2] 崔宇红. 机构知识库自动存储系统研究[J]. 现代图书情报技术, 2010, 26(12): 76-80.
[3] 崔宇红, 张奎. 基于Nutch的开放存取搜索引擎构建研究[J]. 现代图书情报技术, 2010, 26(10): 82-86.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn