基于锚与链接文本扩展的KBES算法隧道策略

doi:10.11925/infotech.1003-3513.2011.03.07

现代图书情报技术

2011, Vol. 27

Issue (3): 45-50 https://doi.org/10.11925/infotech.1003-3513.2011.03.07

知识组织与知识管理

本期目录 | 过刊浏览 | 高级检索

基于锚与链接文本扩展的KBES算法隧道策略

乔建忠

中国科学院国家科学图书馆北京 100190;解放军艺术学院教育技术中心北京 100081;中国科学院研究生院北京 100049

Anchor and Link Text Expansion Based KBES Algorithm Tunneling Strategy

Qiao Jianzhong

National Science Library, Chinese Academy of Sciences, Beijing 100190, China; Educational Technology Center of PLA Academy of Arts, Beijing 100081, China; Graduate University of Chinese Academy of Sciences, Beijing 100049, China

摘要
参考文献
相关文章
Metrics

全文: PDF (483 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要在总结主题爬行器的“真、假隧道”策略的基础上,提出一种解决“假隧道”问题的KBES算法。通过实验分析KBES算法能在一定程度上提高锚与链接文本在启发策略中预测新链接相关性的效率。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	乔建忠

关键词 ：主题搜索, 隧道技术, 搜索算法, 主题爬行器

Abstract：On the basis of summary of “true or false tunnel” strategy on focused crawler, this paper proposes a new KBES algorithm to solute the “false tunnel” problem. The experiments prove that KBES algorithm can improve the efficiency to predict the relevance of new links by anchor and link text in the heuristic strategies to some extent.

Key words： Focused crawling Tunneling Search algorithm Focused crawler

收稿日期: 2011-02-15 出版日期: 2011-05-05

G250.73

引用本文:

乔建忠. 基于锚与链接文本扩展的KBES算法隧道策略[J]. 现代图书情报技术, 2011, 27(3): 45-50.
Qiao Jianzhong. Anchor and Link Text Expansion Based KBES Algorithm Tunneling Strategy. New Technology of Library and Information Service, 2011, 27(3): 45-50.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2011.03.07 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2011/V27/I3/45

[1] Chakrabarti S, Berg M V D, Dom B. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery[J]. Computer Networks, 1999, 31(11): 1623-1640.

[2] Ester M, Gross M, Kriegel H P. Focused Web Crawling: A Generic Framework for Specifying the Use Interest and for Adaptive Crawling Strategies[C]. In: Proceedings of the 27th International Conference on Very Large Data Bases(VLDB2001). San Francisco: Morgan Kaufmann Publishers Inc, 2001: 1-10.

[3] Shchekotykhin K, Jannach D, Friedrich G. Xcrawl: A High-recall Crawling Method for Web Mining [J]. Knowledge and Information Systems, 2009, 25(2):303-326.

[4] Diligenti M, Coetzee F, Lawrence S. Focused Crawling Using Context Graphs[C]. In: Proceedings of the 26th VLDB Conference, Cairo, Egypt. San Francisco: Morgan Kaufmann Publishers Inc, 2000:527-534.

[5] McCallum A, Nigam K, Rennie J, et al. Building Domain-Specific Search Engines with Machine Learning Techniques[C]. In: Proceedings of AAAI Spring Symposium on Intelligent Agents in Cyberspace.Menlo Park: AAAI Press, 1999:28-39.

[6] Rennie J,MeCallum A. Using Reinforcement Learning to Spider the Web Efficiently [C]. In: Proceedings of the International Conference on Machine Learning (ICML99).San Francisco: Morgan Kaufmann Publishers Inc, 1999: 335-343.

[7] 傅向华, 冯博琴, 马兆丰,等.可在线增量自学习的聚焦爬行方法[J]. 西安交通大学学报,2004, 38(6):599-602.

[8] 黄莉, 王成良, 杨铮.面向主题网络爬行的智能隧道穿越算法研究[J].计算机应用研究,2009, 26(8):2931-2933.

[9] 谭骏珊, 陈可钦.聚焦爬行中网页爬行算法的改进[J].电脑知识与技术,2008, 4(35):2145-2146, 2149.

[10] Ehrig M. Ontology-Focused Crawling of Documents and Relational Metadata[D].Germany, Karlsruhe: FZI, University Karlsruhe,2002.

[11] Ester M, Kriegel H, Schubert M. Accurate and Efficient Crawling for Relevant Websites [C]. In: Proceedings of the 30th International Conference on Very Large Data Bases, Toronto, Canada. San Francisco: Morgan Kaufmann Publishers Inc, 2004: 396-407.

[12] 杨贞.基于本体的主题爬虫的设计与实现[D].合肥: 合肥工业大学, 2008.

[13] Mouton A, Marteau P F. Exploiting Routing Information Encoded into Backlinks to Improve Topical Crawling[C]. In: Proceedings of International Conference of Soft Computing and Pattern Recognition(SOCPAR ’09). Malacca: IEEE, 2009: 659-664.

[14] About WordNet[EB/OL]. [2011-03-10].http://wordnet.princeton.edu.

[15] Welcome to HowNet! [EB/OL]. [2011-03-10].http://www.keenage.com.

[16] WebSPHINX: A Personal, Customizable Web Crawler [EB/OL]. [2011-02-12]. http://www.cs.cmu.edu/~rcm/websphinx/.

[17] jsoup: Java HTML Parser [EB/OL]. [2011-02-12]. http://jsoup.org/.

[18] The Apache Software Foundation. Apache Tika [EB/OL].[2011-02-12]. http://tika.apache.org/.

[19] JTextCat 0.1[EB/OL]. [2011-02-12].http://www.jedi.be/pages/JTextCat/.

[20] Ik-Analyzer [EB/OL]. [2011-02-12].http://code.google.com/p/ik-analyzer/.

[21] LingPipe Home [EB/OL]. [2011-02-12]. http://alias-i.com/lingpipe/.

[1]	乔建忠. 一种基于改进BFS算法的主题搜索技术研究[J]. 现代图书情报技术, 2013, 29(7/8): 28-35.
[2]	乔建忠. 一种基于统计特征面向“类型”主题抓取的网页相关性判断策略研究[J]. 现代图书情报技术, 2012, 28(6): 9-16.
[3]	王华秋. 一种基于和声搜索的协同过滤算法研究[J]. 现代图书情报技术, 2012, (12): 79-84.
[4]	白光祖,吕俊生. 基于WebSPHINX的主题搜索引擎原理研究与结构设计[J]. 现代图书情报技术, 2007, 2(11): 58-62.
[5]	李春旺 . 基于OSS的主题搜索引擎设计与实现[J]. 现代图书情报技术, 2007, 2(1): 49-52.
[6]	吴金红,张玉峰,王翠波 . 面向主题的网络竞争情报采集系统*[J]. 现代图书情报技术, 2006, 1(12): 54-57.

Viewed

Full text

Abstract

Cited

Shared

Discussed