|
|
Research and Implementation of Nutch-based Website Harvest and Service System in Special Field |
Chang Zhirong1 Ma Ziwei2 Li Gaohu3 |
1(College of Computer,Beijing University of Posts and Telecommunications, Beijing 100876, China)
2(Beijing University of Post and Telecommunication Library,Beijing 100876, China)
3(Bupt Assets Management Co., Ltd, Beijing 100876,China) |
|
|
Abstract This paper proposes the design of Nutch-based Website Harvest and Service system in Special field under the framework of digital library systems integration. It introduces information filtering module, dictionary-based Chinese analyzer module, GUI information module,topic-knowledge based information processing module as well as the Webservice-based search service modules to improve function and performance of the system. It focuses on text parsing filters, plugin development and applications of the level-automatic clustering of the search results. Finally, integration with other subsystem in digital library is realized through the Webservice-interface, which can provide comprehensive and professional services.
|
Received: 05 March 2010
Published: 25 March 2010
|
|
Corresponding Authors:
Chang Zhirong1
E-mail: changzhirong6@gmail.com
|
About author:: Chang Zhirong,Ma Ziwei,Li Gaohu |
[1] Nutch[EB/OL].[2009-07-20].http://lucene.apache.org/nutch/.
[2] Heritrix[EB/OL].[2009-10-24].http:// crawler.archive.org/.
[3] WCT[EB/OL].[2009-12-24].http://webcurator.sourceforge. net/.
[4] NetarchiveSuite[EB/OL].[2008-11-12].http://netarchive.dk/suite.
[5] Smart Crawler[EB/OL].[2009-11-12].http:// crawler. archive.org/.
[6] Wget[EB/OL].[2010-02-07].http://www.gnu.org/software/wget/.
[7] Hadoop[EB/OL].[2010-02-12].http://hadoop.Apache.org/.
[8] Cutting D.Nutch,Open-Source Web Search[EB/OL].[2009-01-29].http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/www2004.pdf.
[9] 王学松. Lucene+Nutch 搜索引擎开发[M].北京:人民邮电出版社,2008.
[10] 徐健, 张智雄. 基于Nutch的Web网站定向采集系统[J]. 现代图书情报技术, 2009(4):1-6.
[11] HTML Parser[EB/OL].[2009-01-29].http://htmlparser.sourceforge.net/.
[12] Xu J, Xing L, Qin Z. PageRank Algorithm with Semantic Relevance of Anchor Texts[J]. Journal of Harbin Institute of Technology,2009,41 (1):223-225.
[13] Osiński S, Weiss D.Carrot2[EB/OL].[2009-12-05].http://project.carrot2.org.
[14] 苍宏宇,谭宗颖.聚类搜索引擎发展现状研究[J].图书情报工作,2009,33(2):125-128.
[15] Lin Q,Chen C,Zheng L.Design and Implementation of Search Engine System for Digital Library[J].Application Research of Computers,2009, 26 (8):2952-2955. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|