Abstract:This paper introduces two new features——harvest rate and media type as the basis to judge relevance, by refining and reclassifying all kinds of characteristic factors that are used by focused crawlers to predict the priority of Web links, and proposes an improved Best-First Search algorithm. The algorithm uses "fine-grained" policy filtering irrelevant Web pages, selects multiple angles representative characteristic factors and constructs a links priority formula to reveal and predict the subjects of Web links comprehensively. The small-scale experiment comparing with the other three topic search algorithms demonstrates that the improved algorithm has a better performance on harvest rate and the average number of links submitted.
乔建忠. 一种基于改进BFS算法的主题搜索技术研究[J]. 现代图书情报技术, 2013, 29(7/8): 28-35.
Qiao Jianzhong. An Improved Best-First Search Algorithm Based Focused Crawling Research. New Technology of Library and Information Service, 2013, 29(7/8): 28-35.
[1] Chakrabarti S, van den Berg M, Dom B. Focused Crawling: A New Approach to Topic-specific Web Resource Discovery[J]. Computer Networks, 1999, 31(11-16): 1623-1640.[2] Russell S, Norvig P. Artificial Intelligence: A Modern Approach[M]. The 2nd Edition. Upper Saddle River, New Jersey: Prentice Hall, 2003: 94-95.[3] Chakrabarti S. Mining the Web: Discovering Knowledge from Hypertext Data[M]. San Francisco: Morgan-Kaufmann Publishers, 2002:270-279.[4] Haveliwala T H. Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search[J]. IEEE Transactions on Knowledge and Data Engineering, 2003,15(4):784-796.[5] Bharat K, Henzinger M R. Improved Algorithms for Topic Distillation in a Hyperlinked Environment[C]. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York,NY,USA:ACM,1998:104-111.[6] Pandey S, Olston C. Crawl Ordering by Search Impact[C]. In: Proceedings of the International Conference on Web Search and Web Data Mining(WSDM '08). New York, NY, USA: ACM, 2008:3-14.[7] 夏崇镨,康丽.基于叙词表的主题爬虫技术研究[J]. 现代图书情报技术, 2007(5):41-44.(Xia Chongpu,Kang Li. The Focused-crawler Based on Thesaurus[J].New Technology of Library and Information Service,2007(5):41-44.)[8] Brin S, Page L. The Anatomy of a Large-Scale Hypertextual Web Search Engine[J]. Computer Networks and ISDN Systems,1998, 30(1-7): 107-117.[9] Kleinberg J M. Authoritative Sources in a Hyperlinked Environment[J]. Journal of the ACM,1999,46(5):604-632.[10] Shchekotykhin K, Jannach D, Friedrich G. xCrawl: A High-recall Crawling Method for Web Mining[C]. In: Proceedings of the 8th IEEE International Conference on Data Mining. Washington: IEEE Computer Society, 2008:550-559.[11] Barfourosh A A, Motahary H R, Anderson M L, et al. Information Retrieval on the World Wide Web and Active Logic: A Survey and Problem Definition[R]. Technical Report CS-TR-4291. Maryland: Computer Science Department, University of Maryland, 2002.[12] 陈竹敏. 面向垂直搜索引擎的主题爬行技术研究[D]. 济南: 山东大学, 2008. (Chen Zhumin.Research on Focused Crawling for Vertical Search Engine[D].Jinan: Shandong University,2008.)[13] 傅向华, 冯博琴, 马兆丰,等.可在线增量自学习的聚焦爬行方法[J]. 西安交通大学学报, 2004, 38(6):599-602.(Fu Xianghua, Feng Boqin, Ma Zhaofeng, et al. Focused Crawling Method with Online-Incremental Adaptive Learning[J]. Journal of Xi'an Jiaotong University, 2004, 38(6): 599-602.)[14] 黄莉, 王成良, 杨铮.面向主题网络爬行的智能隧道穿越算法研究[J]. 计算机应用研究, 2009, 26(8):2931-2933.(Huang Li,Wang Chengliang,Yang Zheng. Focused Crawling Oriented Intelligent Tunneling Algorithm Research[J].Application Research of Computers, 2009, 26(8):2931-2933.)[15] 谭骏珊, 陈可钦.聚焦爬行中网页爬行算法的改进[J]. 电脑知识与技术, 2008, 4(35):2145-2146.(Tan Junshan, Chen Keqin. The Extension of Focused Crawling Strategy[J].Computer Knowledge and Technology, 2008, 4(35):2145-2146.)[16] Davison B D. Topical Locality in the Web[C]. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'2000).New York,NY,USA:ACM, 2000: 272-279.