In this paper,an open source software for information extraction called Web-Harvest is detailly introduced firstly.With functional expansion and improvement,a Web information extraction system based on Web-Harvest is designed.The paper focuses on the system design idea and system process,and the design of database tables is also briefly described. Finally,the application of the system is introduced.
詹佳佳. 基于Web-Harvest的Web信息抽取系统的设计与应用[J]. 现代图书情报技术, 2010, 26(3): 76-81.
Zhan Jiajia. The Design and Application of a Web Information Extraction System Based on Web-Harvest. New Technology of Library and Information Service, 2010, 26(3): 76-81.
[1] 高洪臻, 陈天文. 网络信息资源的抽取与整合技术[C]. 见:山东省图书馆学会第十三次科学讨论会论文集.2006.
[2] Crescenzi V, Mecca G. Automatic Information Extraction from Large Websites[J]. Journal of the ACM,2004,51(5): 731-779.
[3] 李宏伟, 史培中, 张素智. 一种可行的Web数据抽取包装器的设计方法[J]. 计算机应用与软件,2009,26(3):110-113.
[4] Utku I. Algorithms for Information Extraction and Dissemination on the World-Wide Web[D]. New York: Polytechnic University,2006.
[5] Chang C H, Hsu C N, Lui S C. Automatic Information Extraction from Semi-structured Web Pages by Pattern Discovery[J]. Decision Support Systems,2003,35(1):129-147.
[6] 刘桂峰, 李林, 崔志明. 一种自动抽取Web数据对象的方法[J]. 计算机应用与软件,2009,26(6):48-51.
[7] 刘云中, 林亚平, 陈治平. 基于隐马尔可夫模型的文本信息抽取[J]. 系统仿真学报,2004,16(3):507-510.
[8] 陈俊彬, 曹树金. 基于Heritrix的Web信息抽取[J]. 图书情报工作,2009,53(9):112-115.
[9] 徐健, 张智雄. 基于Nutch的Web网站定向采集系统[J]. 现代图书情报技术,2009(4):1-6.
[10] Web-Harvest[EB/OL].[2009-12-25].http://web-harvest.sourceforge.net.
[11] Heritrix Introduction[EB/OL].[2009-12-25].http://crawler.archive.org.
[12] Nutch Tutorial[EB/OL].[2009-12-25]. http://lucene.apache.org/nutch/tutorial.pdf.