Anchor and Link Text Expansion Based KBES Algorithm Tunneling Strategy
Qiao Jianzhong
National Science Library, Chinese Academy of Sciences, Beijing 100190, China; Educational Technology Center of PLA Academy of Arts, Beijing 100081, China; Graduate University of Chinese Academy of Sciences, Beijing 100049, China
Abstract:On the basis of summary of “true or false tunnel” strategy on focused crawler, this paper proposes a new KBES algorithm to solute the “false tunnel” problem. The experiments prove that KBES algorithm can improve the efficiency to predict the relevance of new links by anchor and link text in the heuristic strategies to some extent.
乔建忠. 基于锚与链接文本扩展的KBES算法隧道策略[J]. 现代图书情报技术, 2011, 27(3): 45-50.
Qiao Jianzhong. Anchor and Link Text Expansion Based KBES Algorithm Tunneling Strategy. New Technology of Library and Information Service, 2011, 27(3): 45-50.
[1] Chakrabarti S, Berg M V D, Dom B. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery[J]. Computer Networks, 1999, 31(11): 1623-1640.[2] Ester M, Gross M, Kriegel H P. Focused Web Crawling: A Generic Framework for Specifying the Use Interest and for Adaptive Crawling Strategies[C]. In: Proceedings of the 27th International Conference on Very Large Data Bases(VLDB2001). San Francisco: Morgan Kaufmann Publishers Inc, 2001: 1-10.[3] Shchekotykhin K, Jannach D, Friedrich G. Xcrawl: A High-recall Crawling Method for Web Mining [J]. Knowledge and Information Systems, 2009, 25(2):303-326.[4] Diligenti M, Coetzee F, Lawrence S. Focused Crawling Using Context Graphs[C]. In: Proceedings of the 26th VLDB Conference, Cairo, Egypt. San Francisco: Morgan Kaufmann Publishers Inc, 2000:527-534.[5] McCallum A, Nigam K, Rennie J, et al. Building Domain-Specific Search Engines with Machine Learning Techniques[C]. In: Proceedings of AAAI Spring Symposium on Intelligent Agents in Cyberspace.Menlo Park: AAAI Press, 1999:28-39.[6] Rennie J,MeCallum A. Using Reinforcement Learning to Spider the Web Efficiently [C]. In: Proceedings of the International Conference on Machine Learning (ICML99).San Francisco: Morgan Kaufmann Publishers Inc, 1999: 335-343. [7] 傅向华, 冯博琴, 马兆丰,等.可在线增量自学习的聚焦爬行方法[J]. 西安交通大学学报,2004, 38(6):599-602.[8] 黄莉, 王成良, 杨铮.面向主题网络爬行的智能隧道穿越算法研究[J].计算机应用研究,2009, 26(8):2931-2933.[9] 谭骏珊, 陈可钦.聚焦爬行中网页爬行算法的改进[J].电脑知识与技术,2008, 4(35):2145-2146, 2149.[10] Ehrig M. Ontology-Focused Crawling of Documents and Relational Metadata[D].Germany, Karlsruhe: FZI, University Karlsruhe,2002.[11] Ester M, Kriegel H, Schubert M. Accurate and Efficient Crawling for Relevant Websites [C]. In: Proceedings of the 30th International Conference on Very Large Data Bases, Toronto, Canada. San Francisco: Morgan Kaufmann Publishers Inc, 2004: 396-407.[12] 杨贞.基于本体的主题爬虫的设计与实现[D].合肥: 合肥工业大学, 2008.[13] Mouton A, Marteau P F. Exploiting Routing Information Encoded into Backlinks to Improve Topical Crawling[C]. In: Proceedings of International Conference of Soft Computing and Pattern Recognition(SOCPAR ’09). Malacca: IEEE, 2009: 659-664.[14] About WordNet[EB/OL]. [2011-03-10].http://wordnet.princeton.edu.[15] Welcome to HowNet! [EB/OL]. [2011-03-10].http://www.keenage.com.[16] WebSPHINX: A Personal, Customizable Web Crawler [EB/OL]. [2011-02-12]. http://www.cs.cmu.edu/~rcm/websphinx/.[17] jsoup: Java HTML Parser [EB/OL]. [2011-02-12]. http://jsoup.org/.[18] The Apache Software Foundation. Apache Tika [EB/OL].[2011-02-12]. http://tika.apache.org/.[19] JTextCat 0.1[EB/OL]. [2011-02-12].http://www.jedi.be/pages/JTextCat/.[20] Ik-Analyzer [EB/OL]. [2011-02-12].http://code.google.com/p/ik-analyzer/.[21] LingPipe Home [EB/OL]. [2011-02-12]. http://alias-i.com/lingpipe/.