一种基于改进BFS算法的主题搜索技术研究

doi:10.11925/infotech.1003-3513.2013.07-08.04

现代图书情报技术

2013, Vol. 29

Issue (7/8): 28-35 https://doi.org/10.11925/infotech.1003-3513.2013.07-08.04

数字图书馆

本期目录 | 过刊浏览 | 高级检索

一种基于改进BFS算法的主题搜索技术研究

乔建忠

解放军艺术学院信息管理中心北京 100081

An Improved Best-First Search Algorithm Based Focused Crawling Research

Qiao Jianzhong

Information Management Center of PLA Academy of Arts, Beijing 100081, China

摘要
参考文献
相关文章
Metrics

全文: PDF (1005 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要通过对Web主题爬行器在预测链接优先级时所用到的特征因子的细化和重新分类,引入收割率和媒体类型两个新特征作为相关性判断依据,提出一种改进的最好优先搜索算法。该算法采用"细粒度"策略过滤不相关网页,选取多个角度有代表性的特征因子构造链接优先级计算公式,以达到全面揭示和预测链接主题的目的。通过与其他三类主题搜索算法的小规模实验比较,证明改进算法在收割率和平均提交链接数上效果较好。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	乔建忠

关键词 ：主题搜索, 搜索算法, 最好优先搜索算法, 主题爬行器, 特征因子

Abstract：This paper introduces two new features——harvest rate and media type as the basis to judge relevance, by refining and reclassifying all kinds of characteristic factors that are used by focused crawlers to predict the priority of Web links, and proposes an improved Best-First Search algorithm. The algorithm uses "fine-grained" policy filtering irrelevant Web pages, selects multiple angles representative characteristic factors and constructs a links priority formula to reveal and predict the subjects of Web links comprehensively. The small-scale experiment comparing with the other three topic search algorithms demonstrates that the improved algorithm has a better performance on harvest rate and the average number of links submitted.

Key words： Focused crawling Search algorithm Best-First Search algorithm Focused crawler Characteristic factor

收稿日期: 2013-04-26 出版日期: 2013-09-02

G250.73

通讯作者: 乔建忠 E-mail: qiaojianzhong@mail.las.ac.cn

引用本文:

乔建忠. 一种基于改进BFS算法的主题搜索技术研究[J]. 现代图书情报技术, 2013, 29(7/8): 28-35.
Qiao Jianzhong. An Improved Best-First Search Algorithm Based Focused Crawling Research. New Technology of Library and Information Service, 2013, 29(7/8): 28-35.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2013.07-08.04 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2013/V29/I7/8/28

[1] Chakrabarti S, van den Berg M, Dom B. Focused Crawling: A New Approach to Topic-specific Web Resource Discovery[J]. Computer Networks, 1999, 31(11-16): 1623-1640.
[2] Russell S, Norvig P. Artificial Intelligence: A Modern Approach[M]. The 2nd Edition. Upper Saddle River, New Jersey: Prentice Hall, 2003: 94-95.
[3] Chakrabarti S. Mining the Web: Discovering Knowledge from Hypertext Data[M]. San Francisco: Morgan-Kaufmann Publishers, 2002:270-279.
[4] Haveliwala T H. Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search[J]. IEEE Transactions on Knowledge and Data Engineering, 2003,15(4):784-796.
[5] Bharat K, Henzinger M R. Improved Algorithms for Topic Distillation in a Hyperlinked Environment[C]. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York,NY,USA:ACM,1998:104-111.
[6] Pandey S, Olston C. Crawl Ordering by Search Impact[C]. In: Proceedings of the International Conference on Web Search and Web Data Mining(WSDM '08). New York, NY, USA: ACM, 2008:3-14.
[7] 夏崇镨,康丽.基于叙词表的主题爬虫技术研究[J]. 现代图书情报技术, 2007(5):41-44.(Xia Chongpu,Kang Li. The Focused-crawler Based on Thesaurus[J].New Technology of Library and Information Service,2007(5):41-44.)
[8] Brin S, Page L. The Anatomy of a Large-Scale Hypertextual Web Search Engine[J]. Computer Networks and ISDN Systems,1998, 30(1-7): 107-117.
[9] Kleinberg J M. Authoritative Sources in a Hyperlinked Environment[J]. Journal of the ACM,1999,46(5):604-632.
[10] Shchekotykhin K, Jannach D, Friedrich G. xCrawl: A High-recall Crawling Method for Web Mining[C]. In: Proceedings of the 8th IEEE International Conference on Data Mining. Washington: IEEE Computer Society, 2008:550-559.
[11] Barfourosh A A, Motahary H R, Anderson M L, et al. Information Retrieval on the World Wide Web and Active Logic: A Survey and Problem Definition[R]. Technical Report CS-TR-4291. Maryland: Computer Science Department, University of Maryland, 2002.
[12] 陈竹敏. 面向垂直搜索引擎的主题爬行技术研究[D]. 济南: 山东大学, 2008. (Chen Zhumin.Research on Focused Crawling for Vertical Search Engine[D].Jinan: Shandong University,2008.)
[13] 傅向华, 冯博琴, 马兆丰,等.可在线增量自学习的聚焦爬行方法[J]. 西安交通大学学报, 2004, 38(6):599-602.(Fu Xianghua, Feng Boqin, Ma Zhaofeng, et al. Focused Crawling Method with Online-Incremental Adaptive Learning[J]. Journal of Xi'an Jiaotong University, 2004, 38(6): 599-602.)
[14] 黄莉, 王成良, 杨铮.面向主题网络爬行的智能隧道穿越算法研究[J]. 计算机应用研究, 2009, 26(8):2931-2933.(Huang Li,Wang Chengliang,Yang Zheng. Focused Crawling Oriented Intelligent Tunneling Algorithm Research[J].Application Research of Computers, 2009, 26(8):2931-2933.)
[15] 谭骏珊, 陈可钦.聚焦爬行中网页爬行算法的改进[J]. 电脑知识与技术, 2008, 4(35):2145-2146.(Tan Junshan, Chen Keqin. The Extension of Focused Crawling Strategy[J].Computer Knowledge and Technology, 2008, 4(35):2145-2146.)
[16] Davison B D. Topical Locality in the Web[C]. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'2000).New York,NY,USA:ACM, 2000: 272-279.

[1]	逯万辉, 谭宗颖. *学术成果主题新颖性测度方法研究^——基于Doc2Vec和HMM算法**[J]. 数据分析与知识发现, 2018, 2(3): 22-29.
[2]	乔建忠. 一种基于统计特征面向“类型”主题抓取的网页相关性判断策略研究[J]. 现代图书情报技术, 2012, 28(6): 9-16.
[3]	王华秋. 一种基于和声搜索的协同过滤算法研究[J]. 现代图书情报技术, 2012, (12): 79-84.
[4]	乔建忠. 基于锚与链接文本扩展的KBES算法隧道策略[J]. 现代图书情报技术, 2011, 27(3): 45-50.
[5]	白光祖,吕俊生. 基于WebSPHINX的主题搜索引擎原理研究与结构设计[J]. 现代图书情报技术, 2007, 2(11): 58-62.
[6]	李春旺 . 基于OSS的主题搜索引擎设计与实现[J]. 现代图书情报技术, 2007, 2(1): 49-52.
[7]	吴金红,张玉峰,王翠波 . 面向主题的网络竞争情报采集系统*[J]. 现代图书情报技术, 2006, 1(12): 54-57.

Viewed

Full text

Abstract

Cited

Shared

Discussed