Please wait a minute...
New Technology of Library and Information Service  2010, Vol. 26 Issue (3): 19-26    DOI: 10.11925/infotech.1003-3513.2010.03.04
article Current Issue | Archive | Adv Search |
Research and Implementation of Nutch-based Website Harvest and Service System in Special Field
Chang Zhirong1   Ma Ziwei2   Li Gaohu3
1(College of Computer,Beijing University of Posts and Telecommunications, Beijing 100876, China)
2(Beijing University of Post and Telecommunication Library,Beijing 100876, China) 
3(Bupt Assets Management Co., Ltd, Beijing 100876,China)
Download: PDF(1298 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

This paper proposes the design of Nutch-based Website Harvest and Service system in Special field under the framework of digital library systems integration. It introduces information filtering module, dictionary-based Chinese analyzer module, GUI information module,topic-knowledge based information processing module as well as the Webservice-based search service modules to improve function and performance of the system. It focuses on text parsing filters, plugin development and applications of the level-automatic clustering of the search results. Finally, integration with other subsystem in digital library is realized through the Webservice-interface, which can provide comprehensive and professional services.

Key wordsNutch      Website harvest      Chinese analyzer plugin      Webservice      Integration services     
Received: 05 March 2010      Published: 25 March 2010
: 

G250

 
Corresponding Authors: Chang Zhirong1     E-mail: changzhirong6@gmail.com
About author:: Chang Zhirong,Ma Ziwei,Li Gaohu

Cite this article:

Chang Zhirong,Ma Ziwei,Li Gaohu. Research and Implementation of Nutch-based Website Harvest and Service System in Special Field. New Technology of Library and Information Service, 2010, 26(3): 19-26.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2010.03.04     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2010/V26/I3/19

[1] Nutch[EB/OL].[2009-07-20].http://lucene.apache.org/nutch/.
[2] Heritrix[EB/OL].[2009-10-24].http:// crawler.archive.org/.
[3] WCT[EB/OL].[2009-12-24].http://webcurator.sourceforge. net/.
[4] NetarchiveSuite[EB/OL].[2008-11-12].http://netarchive.dk/suite.
[5] Smart Crawler[EB/OL].[2009-11-12].http:// crawler. archive.org/.
[6] Wget[EB/OL].[2010-02-07].http://www.gnu.org/software/wget/.
[7] Hadoop[EB/OL].[2010-02-12].http://hadoop.Apache.org/.
[8] Cutting D.Nutch,Open-Source Web Search[EB/OL].[2009-01-29].http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/www2004.pdf.
[9] 王学松. Lucene+Nutch 搜索引擎开发[M].北京:人民邮电出版社,2008.
[10] 徐健, 张智雄. 基于Nutch的Web网站定向采集系统[J]. 现代图书情报技术, 2009(4):1-6.
[11] HTML Parser[EB/OL].[2009-01-29].http://htmlparser.sourceforge.net/.
[12] Xu J, Xing L, Qin Z. PageRank Algorithm with Semantic Relevance of Anchor Texts[J]. Journal of Harbin Institute of Technology,2009,41 (1):223-225.
[13]  Osiński  S, Weiss D.Carrot2[EB/OL].[2009-12-05].http://project.carrot2.org.
[14] 苍宏宇,谭宗颖.聚类搜索引擎发展现状研究[J].图书情报工作,2009,33(2):125-128.
[15] Lin Q,Chen C,Zheng L.Design and Implementation of Search Engine System for Digital Library[J].Application Research of Computers,2009, 26 (8):2952-2955.

[1] Cui Yuhong. Research on Automatic Archiving System for Institutional Repositories[J]. 现代图书情报技术, 2010, 26(12): 76-80.
[2] Cui Yuhong, Zhang Kui. Research on Building an Open Access Search Engine with Nutch[J]. 现代图书情报技术, 2010, 26(10): 82-86.
[3] Xu Jian,Zhang Zhixiong. Targeted Websites Harvest System Based on Nutch[J]. 现代图书情报技术, 2009, 25(4): 1-6.
[4] Xie Jing,Ma Ziwei. Research and Implementation of Digital Resources Integration and Services Platform Based on WebService Technology[J]. 现代图书情报技术, 2008, 24(11): 7-12.
[5] Wang Liang,Guo Yiping. MyLibrary@HUST System Based on Webservice[J]. 现代图书情报技术, 2004, 20(11): 49-52.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn