基于OSS的主题搜索引擎设计与实现

doi:10.11925/infotech.1003-3513.2007.01.12

现代图书情报技术

2007, Vol. 2

Issue (1): 49-52 https://doi.org/10.11925/infotech.1003-3513.2007.01.12

专题研究

本期目录 | 过刊浏览 | 高级检索

基于OSS的主题搜索引擎设计与实现

李春旺

（中国科学院文献情报中心北京 100080）

Design and Implementation of Focused Crawler Based on OSS

Li Chunwang

(Library of Chinese Academy of Sciences, Beijing 100080, China)

摘要
参考文献
相关文章
Metrics

全文: PDF (483 KB)
输出: BibTeX | EndNote (RIS)

摘要

在分析主题搜索引擎体系结构之后，提出基于OSS的系统实现策略，重点讨论主题建模方法、主题相关度算法以及基于相同代码规范、基于Web Service接口规范、基于JNI接口规范的开源系统集成实现技术。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	李春旺

关键词 ：主题爬行器, 搜索引擎, 开放源码软件, 系统设计与实现

Abstract：

After analyzing the architecture of a focused crawler and its implemented strategies based on OSS, this paper emphatically discusses subject modeling and related algorithms, and explains the detailed integration technologies which includes the same Java standards, Web services and Java Native Interface (JNI).

Key words： Focused crawler Search engine OSS System design and implementation

收稿日期: 2006-11-10 出版日期: 2007-01-25

TP39

通讯作者: 李春旺 E-mail: licw@mail.las.ac.cn

作者简介: 李春旺

引用本文:

李春旺 . 基于OSS的主题搜索引擎设计与实现[J]. 现代图书情报技术, 2007, 2(1): 49-52.
Li Chunwang . Design and Implementation of Focused Crawler Based on OSS. New Technology of Library and Information Service, 2007, 2(1): 49-52.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2007.01.12 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2007/V2/I1/49

1Chakrabarti S, Punera K, Subramanyam M. Accelerated focused crawling through online relevance feedback, WWW2002, May 7-11,2002,Honolulu, Hawaii,USA. http://www.cs.berkeley.edu/~soumen/doc/www2002m/p336-chakrabarti.pdf (Accessed Nov. 8, 2006)
2Mitchell S, Mooney M, Mason J, et al. iVia open source virtual library system.D-LibMagazine,2003,9(1).http://www.dlib.org/dlib/january03/mitchell/01mitchell.html (Accessed Nov. 8, 2006)
3INFOMINE: scholarly internet resource collections. http://infomine.ucr.edu/ (Accessed Nov. 8, 2006)
4Bergman M K. Six major trends affecting knowledge management and information technology.White paper published by BrightPlanet Corporation, July 2003
5Anthes G. Search engines——the future. http://www.computerworld.com/printthis/2004/0,4814,91841,00.html (Accessed Nov. 8, 2006)
6李春旺. Web信息主题采集技术研究.图书情报工作,2005,49(4):77-80,70
7JSpider - the open source Web robot. http://j-spider.sourceforge.net/ (Accessed Nov. 8, 2006)
8WebSPHINX a personal,customizable Web crawler. http://www.cs.cmu.edu/~rcm/websphinx/ (Accessed Nov. 8, 2006)
9WebLech URL spider. http://weblech.sourceforge.net/ (Accessed Nov. 8, 2006)
10Greenstein D. Draft report of a meeting convened by the digital library federation on October 5-6, 2001 in Washington DC to consider Open Source Software for Libraries. October 22, 2001. http://www.diglib.org/architectures/ossrep.htm (Accessed Nov. 8, 2006)
11Robert C. Miller,Krishna Bharat. SPHINX: a framework for creating personal, site-specific Web crawlers. Computer Network and ISDN Systems, 1998(30):119-130
12Ehrig M. Ontology - focused crawling of documents and relational metadata.(Master thesis). University of Karlsruhe, Germany,2002. http://www2002.org/CDROM/poster/94/ (Accessed Nov. 8, 2006)
13Ehrig M, Maedche A. Ontology-focused crawling of Web documents.http://www.aifb.uni-karlsruhe.de/WBS/meh/publications/ehrig03ontology.pdf (Accessed Nov. 8, 2006)
14Gawrysiak P. Using data mining methodology for text retrieval. http://bolek.ii.pw.edu.pl/~gawrysia/publ/DIBSarticle.pdf (Accessed Nov. 8, 2006)
15Clever System (HITS) - A page ranking algorithm developed by IBM. http://www.ecsl.cs.sunysb.edu/~chiueh/cse646/cn4/cn4.html (Accessed Nov. 8, 2006)
16Multivalent. http://multivalent.sourceforge.net/ (Accessed Nov. 8, 2006)
17Apache Lucene.http://lucene.apache.org/java/docs/ (Accessed Nov. 8, 2006)
18TextCat language guesser. http://www.let.rug.nl/~vannoord/TextCat/ (Accessed Nov. 8, 2006)
19Cavnar W B,Trenkle J M.N-gram-based text categorization.In Proceedings of Third Annual Symposiumon Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics,1994(4):161-175
20计算所汉语词法分析系统ICTCLAS. http://www.nlp.org.cn/project/project.php?proj_id=6.2004-12-05 (Accessed Nov. 8, 2006)

[1]	刘彤,倪维健,柳梅. 面向搜索引擎查询日志的领域术语自动识别方法^*[J]. 现代图书情报技术, 2016, 32(2): 25-33.
[2]	童国平, 孙建军. 基于搜索日志的用户行为分析[J]. 现代图书情报技术, 2015, 31(7-8): 80-88.
[3]	王晰巍, 赵丹, 杨梦晴, 魏俊巍. 行业网站搜索引擎优化指标及实证研究——基于信息生态视角的分析[J]. 现代图书情报技术, 2015, 31(3): 75-83.
[4]	陈勇, 李红莲, 吕学强. 网络用户搜索行为特征分析[J]. 现代图书情报技术, 2014, 30(12): 10-17.
[5]	乔建忠. 一种基于改进BFS算法的主题搜索技术研究[J]. 现代图书情报技术, 2013, 29(7/8): 28-35.
[6]	乔建忠. 一种基于统计特征面向“类型”主题抓取的网页相关性判断策略研究[J]. 现代图书情报技术, 2012, 28(6): 9-16.
[7]	张李义, 陈明英. 搜索引擎的灵敏度和特异度研究[J]. 现代图书情报技术, 2011, 27(7/8): 41-46.
[8]	王继民, 李雷明子, 张鹏. 搜索引擎日志挖掘领域的论文合著网络分析[J]. 现代图书情报技术, 2011, 27(4): 58-63.
[9]	乔建忠. 基于锚与链接文本扩展的KBES算法隧道策略[J]. 现代图书情报技术, 2011, 27(3): 45-50.
[10]	张红斌, 曹义亲. 混合多层分类和朴素贝叶斯模型的垂直搜索引擎分类器设计[J]. 现代图书情报技术, 2011, 27(3): 73-79.
[11]	周之诚. 基于查询意图聚类的实时搜索建议[J]. 现代图书情报技术, 2011, 27(2): 87-93.
[12]	柯青, 成颖, 郑彦宁, 潘云涛. 搜索引擎可用性评价指标体系构建[J]. 现代图书情报技术, 2011, (11): 24-30.
[13]	景璟, 洪颖, 蒋媛媛, 杲晓锋. 基于相关反馈的Web检索提问融合研究[J]. 现代图书情报技术, 2011, 27(1): 57-62.
[14]	郭少友. 基于通用搜索引擎的深层网络表面化方法研究[J]. 现代图书情报技术, 2010, 26(2): 24-30.
[15]	崔宇红, 张奎. 基于Nutch的开放存取搜索引擎构建研究[J]. 现代图书情报技术, 2010, 26(10): 82-86.

Viewed

Full text

Abstract

Cited

Shared

Discussed