基于Nutch的Web网站定向采集系统*

doi:10.11925/infotech.1003-3513.2009.04.01

现代图书情报技术

2009, Vol. 25

Issue (4): 1-6 https://doi.org/10.11925/infotech.1003-3513.2009.04.01

专题

本期目录 | 过刊浏览 | 高级检索

基于Nutch的Web网站定向采集系统*

徐健^1,²张智雄¹

¹（中国科学院国家科学图书馆北京 100190）
²（中山大学资讯管理系广州 |510275）

Targeted Websites Harvest System Based on Nutch

Xu Jian^1,2Zhang Zhixiong ¹

¹(National Science Library, Chinese Academy of Sciences, Beijing 100190, China)
²(Department of Information Management, Sun Yat-Sen University, Guangzhou 510275, China)

摘要
参考文献
相关文章
Metrics

全文: PDF (699 KB)
输出: BibTeX | EndNote (RIS)

摘要

在对目前具有代表性的开源网络抓取软件Nutch、Heritrix、WCT、Web-Harvest进行比较分析的基础上，提出基于Nutch的Web网站定向采集系统，并对种子站点的选取、抓取过程管理、网页去噪、新种子站点的发现等关键问题进行重点探讨。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	徐健
	张智雄

关键词 ：网站定向采集系统, Nutch, 网站抓取, 网页去噪

Abstract：

The paper analyzes typical open source Web crawl software, such as Nutch, Heritrix, WCT, and Web-Harvest. Following the analyzed result, it puts forward a targeted websites harvest system based on Nutch. Four key issues of this system are discussed emphatically, which are the initial seed websites selection, the harvest process management, the web page content denoising, and discovering of new seed websites.

Key words： Targeted Websites Harvest System Nutch Website Crawl Web Page Denoising

收稿日期: 2009-02-17 出版日期: 2009-04-25

G250.76

基金资助:

*本文系国家“十一五”科技支撑计划子课题“网络科技信息监测与评价”（项目编号：2006BAH03B05）的研究成果之一。

通讯作者: 徐健 E-mail: xujian@mail.las.ac.cn

作者简介: 徐健,张智雄

引用本文:

徐健,张智雄. 基于Nutch的Web网站定向采集系统*[J]. 现代图书情报技术, 2009, 25(4): 1-6.
Xu Jian,Zhang Zhixiong. Targeted Websites Harvest System Based on Nutch. New Technology of Library and Information Service, 2009, 25(4): 1-6.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2009.04.01 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2009/V25/I4/1

［1］ Nutch［EB/OL］. ［2009-01-29］. http://wiki.apache.org/nutch/.
［2］ Doug Cutting. Nutch, Open-Source Web Search［EB/OL］. ［2009-01-29］. http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/www2004.pdf.
［3］ Heritrix Introduction［EB/OL］. ［2009-01-29］. http://crawler.archive.org/.
［4］ The Web Curator Tool Project［EB/OL］. ［2009-01-29］. http://webcurator.sourceforge.net/.
［5］ Web-Harvest［EB/OL］. ［2009-01-29］. http://web-harvest.sourceforge.net/.
［6］ Html Parser［EB/OL］. ［2009-01-29］. http://htmlparser.sourceforge.net/.
［7］ Intute, Best of the Web［EB/OL］. ［2009-01-29］. http://www.intute.ac.uk/.
［8］ Dmoz Open Directory Project［EB/OL］. ［2009-01-29］. http://www.dmoz.org/.
［9］ Yahoo! Developer Network［EB/OL］. ［2009-01-29］. http://developer.yahoo.com/search/.
［10］Nutch Version 0.8.x Tutorial［EB/OL］. ［2009-01-29］. http://lucene.apache.org/nutch/tutorial8.html.

[1]	常智荣,马自卫,李高虎. 基于Nutch的专题网页资源采集服务系统的设计与实现[J]. 现代图书情报技术, 2010, 26(3): 19-26.
[2]	崔宇红. 机构知识库自动存储系统研究[J]. 现代图书情报技术, 2010, 26(12): 76-80.
[3]	崔宇红, 张奎. 基于Nutch的开放存取搜索引擎构建研究[J]. 现代图书情报技术, 2010, 26(10): 82-86.

Viewed

Full text

Abstract

Cited

Shared

Discussed