Please wait a minute...
New Technology of Library and Information Service  2013, Vol. 29 Issue (7/8): 28-35    DOI: 10.11925/infotech.1003-3513.2013.07-08.04
Current Issue | Archive | Adv Search |
An Improved Best-First Search Algorithm Based Focused Crawling Research
Qiao Jianzhong
Information Management Center of PLA Academy of Arts, Beijing 100081, China
Download: PDF(1005 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  This paper introduces two new features——harvest rate and media type as the basis to judge relevance, by refining and reclassifying all kinds of characteristic factors that are used by focused crawlers to predict the priority of Web links, and proposes an improved Best-First Search algorithm. The algorithm uses "fine-grained" policy filtering irrelevant Web pages, selects multiple angles representative characteristic factors and constructs a links priority formula to reveal and predict the subjects of Web links comprehensively. The small-scale experiment comparing with the other three topic search algorithms demonstrates that the improved algorithm has a better performance on harvest rate and the average number of links submitted.
Key wordsFocused crawling      Search algorithm      Best-First Search algorithm      Focused crawler      Characteristic factor     
Received: 26 April 2013      Published: 02 September 2013
: 

G250.73

 

Cite this article:

Qiao Jianzhong. An Improved Best-First Search Algorithm Based Focused Crawling Research. New Technology of Library and Information Service, 2013, 29(7/8): 28-35.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2013.07-08.04     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2013/V29/I7/8/28

[1] Chakrabarti S, van den Berg M, Dom B. Focused Crawling: A New Approach to Topic-specific Web Resource Discovery[J]. Computer Networks, 1999, 31(11-16): 1623-1640.
[2] Russell S, Norvig P. Artificial Intelligence: A Modern Approach[M]. The 2nd Edition. Upper Saddle River, New Jersey: Prentice Hall, 2003: 94-95.
[3] Chakrabarti S. Mining the Web: Discovering Knowledge from Hypertext Data[M]. San Francisco: Morgan-Kaufmann Publishers, 2002:270-279.
[4] Haveliwala T H. Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search[J]. IEEE Transactions on Knowledge and Data Engineering, 2003,15(4):784-796.
[5] Bharat K, Henzinger M R. Improved Algorithms for Topic Distillation in a Hyperlinked Environment[C]. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York,NY,USA:ACM,1998:104-111.
[6] Pandey S, Olston C. Crawl Ordering by Search Impact[C]. In: Proceedings of the International Conference on Web Search and Web Data Mining(WSDM '08). New York, NY, USA: ACM, 2008:3-14.
[7] 夏崇镨,康丽.基于叙词表的主题爬虫技术研究[J]. 现代图书情报技术, 2007(5):41-44.(Xia Chongpu,Kang Li. The Focused-crawler Based on Thesaurus[J].New Technology of Library and Information Service,2007(5):41-44.)
[8] Brin S, Page L. The Anatomy of a Large-Scale Hypertextual Web Search Engine[J]. Computer Networks and ISDN Systems,1998, 30(1-7): 107-117.
[9] Kleinberg J M. Authoritative Sources in a Hyperlinked Environment[J]. Journal of the ACM,1999,46(5):604-632.
[10] Shchekotykhin K, Jannach D, Friedrich G. xCrawl: A High-recall Crawling Method for Web Mining[C]. In: Proceedings of the 8th IEEE International Conference on Data Mining. Washington: IEEE Computer Society, 2008:550-559.
[11] Barfourosh A A, Motahary H R, Anderson M L, et al. Information Retrieval on the World Wide Web and Active Logic: A Survey and Problem Definition[R]. Technical Report CS-TR-4291. Maryland: Computer Science Department, University of Maryland, 2002.
[12] 陈竹敏. 面向垂直搜索引擎的主题爬行技术研究[D]. 济南: 山东大学, 2008. (Chen Zhumin.Research on Focused Crawling for Vertical Search Engine[D].Jinan: Shandong University,2008.)
[13] 傅向华, 冯博琴, 马兆丰,等.可在线增量自学习的聚焦爬行方法[J]. 西安交通大学学报, 2004, 38(6):599-602.(Fu Xianghua, Feng Boqin, Ma Zhaofeng, et al. Focused Crawling Method with Online-Incremental Adaptive Learning[J]. Journal of Xi'an Jiaotong University, 2004, 38(6): 599-602.)
[14] 黄莉, 王成良, 杨铮.面向主题网络爬行的智能隧道穿越算法研究[J]. 计算机应用研究, 2009, 26(8):2931-2933.(Huang Li,Wang Chengliang,Yang Zheng. Focused Crawling Oriented Intelligent Tunneling Algorithm Research[J].Application Research of Computers, 2009, 26(8):2931-2933.)
[15] 谭骏珊, 陈可钦.聚焦爬行中网页爬行算法的改进[J]. 电脑知识与技术, 2008, 4(35):2145-2146.(Tan Junshan, Chen Keqin. The Extension of Focused Crawling Strategy[J].Computer Knowledge and Technology, 2008, 4(35):2145-2146.)
[16] Davison B D. Topical Locality in the Web[C]. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'2000).New York,NY,USA:ACM, 2000: 272-279.
[1] Qiao Jianzhong. Statistical Characteristics Based Web Page Relevance Judgment Strategy for the “Type” Topics Crawled[J]. 现代图书情报技术, 2012, 28(6): 9-16.
[2] Wang Huaqiu. Research of a Collaborative Filtering Algorithm Based on Harmony Search[J]. 现代图书情报技术, 2012, (12): 79-84.
[3] Huang Wei, Jin Yabo, Hu Changlong. Focused Crawling for Network Public Opinion’s Topic Information[J]. 现代图书情报技术, 2012, (11): 65-71.
[4] Qiao Jianzhong. Anchor and Link Text Expansion Based KBES Algorithm Tunneling Strategy[J]. 现代图书情报技术, 2011, 27(3): 45-50.
[5] Huang Wei,Zhang Liyi. Research on Focused Merchandise Information Crawling Based on Semantic Crawler[J]. 现代图书情报技术, 2010, 26(1): 3-8.
[6] Qian Aibing. A Model for Analyzing Public Opinion Under the Web and Its Implementation[J]. 现代图书情报技术, 2008, 24(4): 49-55.
[7] Xia Chongpu,Kang Li . The Focused-crawler Based on Thesaurus[J]. 现代图书情报技术, 2007, 2(5): 41-44.
[8] Bai Guangzu,Lv Junsheng. Principle Research and Architecture Design of Focused Crawler Based on WebSPHINX[J]. 现代图书情报技术, 2007, 2(11): 58-62.
[9] Li Chunwang . Design and Implementation of Focused Crawler Based on OSS[J]. 现代图书情报技术, 2007, 2(1): 49-52.
[10] Hou Zhenyu. Implementation of a Dynamic Search System Based on Fish Search Algorithm[J]. 现代图书情报技术, 2002, 18(6): 33-35.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn