基于Web-Harvest的Web信息抽取系统的设计与应用

doi:10.11925/infotech.1003-3513.2010.03.13

现代图书情报技术

2010, Vol. 26

Issue (3): 76-81 https://doi.org/10.11925/infotech.1003-3513.2010.03.13

应用实践

本期目录 | 过刊浏览 | 高级检索

基于Web-Harvest的Web信息抽取系统的设计与应用

詹佳佳

(中山大学资讯管理系广州 510006)

The Design and Application of a Web Information Extraction System Based on Web-Harvest

Zhan Jiajia

(Department of Information Management, Sun Yat-Sen University, Guangzhou 510006,China)

摘要
参考文献
相关文章
Metrics

全文: PDF (829 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要

详细介绍信息抽取开源软件Web-Harvest，并在其基础之上进行功能扩展和改进，设计一个通用性强的Web信息抽取系统，重点阐述开发系统的设计思想和系统流程，并简单介绍系统的数据库表设计。最后，介绍该Web信息抽取系统的应用。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	詹佳佳

关键词 ： Web-Harvest, Web信息抽取

Abstract：

In this paper,an open source software for information extraction called Web-Harvest is detailly introduced firstly.With functional expansion and improvement,a Web information extraction system based on Web-Harvest is designed.The paper focuses on the system design idea and system process,and the design of database tables is also briefly described. Finally,the application of the system is introduced.

Key words： Web-Harvest Web information extraction

收稿日期: 2010-01-28 出版日期: 2010-03-25

G250.76

通讯作者: 詹佳佳 E-mail: jiajiazhan2005@126.com

作者简介: 詹佳佳

引用本文:

詹佳佳. 基于Web-Harvest的Web信息抽取系统的设计与应用[J]. 现代图书情报技术, 2010, 26(3): 76-81.
Zhan Jiajia. The Design and Application of a Web Information Extraction System Based on Web-Harvest. New Technology of Library and Information Service, 2010, 26(3): 76-81.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2010.03.13 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2010/V26/I3/76

［1］高洪臻, 陈天文. 网络信息资源的抽取与整合技术［C］. 见：山东省图书馆学会第十三次科学讨论会论文集.2006.
［2］ Crescenzi V, Mecca G. Automatic Information Extraction from Large Websites［J］. Journal of the ACM,2004，51(5): 731-779.
［3］李宏伟, 史培中, 张素智. 一种可行的Web数据抽取包装器的设计方法［J］. 计算机应用与软件，2009,26(3):110-113.
［4］ Utku I. Algorithms for Information Extraction and Dissemination on the World-Wide Web［D］. New York: Polytechnic University,2006.
［5］ Chang C H, Hsu C N, Lui S C. Automatic Information Extraction from Semi-structured Web Pages by Pattern Discovery［J］. Decision Support Systems,2003,35(1):129-147.
［6］刘桂峰, 李林, 崔志明. 一种自动抽取Web数据对象的方法［J］. 计算机应用与软件，2009,26(6):48-51.
［7］刘云中, 林亚平, 陈治平. 基于隐马尔可夫模型的文本信息抽取［J］. 系统仿真学报，2004,16(3):507-510.
［8］陈俊彬, 曹树金. 基于Heritrix的Web信息抽取［J］. 图书情报工作，2009，53(9):112-115.
［9］徐健, 张智雄. 基于Nutch的Web网站定向采集系统［J］. 现代图书情报技术，2009(4):1-6.
［10］ Web-Harvest［EB/OL］.［2009-12-25］.http://web-harvest.sourceforge.net.
［11］ Heritrix Introduction［EB/OL］.［2009-12-25］.http://crawler.archive.org.
［12］ Nutch Tutorial［EB/OL］.［2009-12-25］. http://lucene.apache.org/nutch/tutorial.pdf.

[1]	聂卉黄贵鹏. 树编辑距离在Web信息抽取中的应用与实现*[J]. 现代图书情报技术, 2010, 26(5): 29-34.
[2]	藕军,任明仑 . 搜索引擎返回结果自动抽取[J]. 现代图书情报技术, 2007, 2(2): 49-52.

Viewed

Full text

Abstract

Cited

Shared

Discussed