一种基于统计特征面向“类型”主题抓取的网页相关性判断策略研究

doi:10.11925/infotech.1003-3513.2012.06.02

现代图书情报技术

2012, Vol. 28

Issue (6): 9-16 https://doi.org/10.11925/infotech.1003-3513.2012.06.02

数字图书馆

本期目录 | 过刊浏览 | 高级检索

一种基于统计特征面向“类型”主题抓取的网页相关性判断策略研究

乔建忠

解放军艺术学院信息管理中心北京 100081

Statistical Characteristics Based Web Page Relevance Judgment Strategy for the “Type” Topics Crawled

Qiao Jianzhong

Information Management Center of PLA Academy of Arts, Beijing 100081, China

摘要
参考文献
相关文章
Metrics

全文: PDF (604 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要为满足主题爬行器在线分类的轻量化设计要求,提出一种基于多项表示网络文档类型的统计特征实现网页按类型进行主题相关性判断的策略;借助WEKA提供的API,为该主题相关性判断策略设计相应的训练算法和分类算法。通过分类准确率、效率和特征选择实验,证明该主题相关性判断策略的有效性以及5项对类型识别起关键作用的统计特征。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	乔建忠

关键词 ：相关性判断策略, 主题爬行器, 主题搜索, 数字图书馆

Abstract：This paper proposes a new Web page type relevance judgment strategy based on several statistical characteristics of Web document types to meet the online classification lightweight design requirements of focused crawler. Using the API provided by WEKA, this paper devises appropriate training algorithm and classification algorithm for the relevance judgment strategy. The experiments of classification accuracy, efficiency, and attribute selection demonstrate the validity of the relevance judgment strategy and five Web page statistical characteristics playing a key role in the type identification.

Key words： Relevance judgment strategy Focused crawler Focused crawling Digital library

收稿日期: 2012-05-16 出版日期: 2012-08-30

G250.73

引用本文:

乔建忠. 一种基于统计特征面向“类型”主题抓取的网页相关性判断策略研究[J]. 现代图书情报技术, 2012, 28(6): 9-16.
Qiao Jianzhong. Statistical Characteristics Based Web Page Relevance Judgment Strategy for the “Type” Topics Crawled. New Technology of Library and Information Service, 2012, 28(6): 9-16.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2012.06.02 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2012/V28/I6/9

[1] Chakrabarti S, Berg M V D, Dom B. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery[J]. Computer Networks: The International Journal of Computer and Telecommunications Networking, 1999, 31(11-16): 1623-1640.

[2] Zhang Z Y, Nasraoui O. Profile-based Focused Crawler for Social Media-Sharing Websites[C].In: Proceedings of the 20th IEEE International Conference on Tools with Artificial Intelligence 2008(ICTAI’08). Washington: IEEE Computer Society, 2008:317-324.

[3] 周立柱, 林玲.聚焦爬虫技术研究综述[J]. 计算机应用 ,2005, 25(9):1965-1969. (Zhou Lizhu, Lin Ling. Survey on the Research of Focused Crawling Technique[J].Journal of Computer Applications,2005,25(9):1965-1969.)

[4] Hurst M, Maykov A. Social Streams Blog Crawler[C]. In: Proceedings of the 2009 IEEE International Conference on Data Engineering(ICDE ’09).Washington: IEEE Computer Society, 2009: 1615-1618.

[5] Shchekotykhin K, Jannach D, Friedrich G. xCrawl: A High-recall Crawling Method for Web Mining[C]. In: Proceedings of the 8th IEEE International Conference on Data Mining(ICDM’08).Washington: IEEE Computer Society, 2008:550-559.

[6] Gatterbauer W, Bohunsky P, Herzog M, et al. Towards Domain-independent Information Extraction from Web Tables[C]. In: Proceedings of the 16th International Conference on World Wide Web(WWW’07). New York: ACM, 2007:71-80.

[7] Cai D, Yu S P, Wen J R, et al. VIPS: A Vision-based Page Segmentation Algorithm[R/OL]. [2012-03-10]. ftp://ftp.research.microsoft.com/pub/tr/tr-2003-79.pdf.

[8] Kovacevic M, Diligenti M, Gori M, et al. Recognition of Common Areas in a Web Page Using Visual Information: A Possible Application in a Page Classification[C]. In: Proceedings of 2002 IEEE International Conference on Data Mining(ICDM’ 02). Washington: IEEE Computer Society, 2002:250.

[9] Hnse M, Kan M Y, Karduck A P. Kairos: Proactive Harvesting of Research Paper Metadata from Scientific Conference Web Sites[C]. In: Proceedings of the Role of Digital Libraries in a Time of Global Change, and the 12th International Conference on Asia-Pacific Digital Libraries(ICADL 2010).Berlin: Springer-Verlag, 2010: 226-235.

[10] 陈翰,周杰,李弼程.一种基于综合特征的网页类型识别方法[J]. 信息工程大学学报 ,2011,12(6):738-744.(Chen Han, Zhou Jie, Li Bicheng. Genre Recognition Method of Web Pages Based on Integral Features[J].Journal of Information Engineering University,2011,12(6):738-744.)

[11] 吴思竹,张智雄.基于网页特征识别的噪音网页过滤方法研究[J]. 情报理论与实践 ,2011,34(4):111-114.(Wu Sizhu, Zhang Zhixiong. Research on Noise Web Page Filtering Method Based on Web Page Chatacteristics Identification[J]. Information Studies: Theory & Application, 2011,34(4):111-114.)

[12] Machine Learning Group at University of Waikato.Weka 3: Data Mining Software in Java[EB/OL]. [2012-03-29].http://www.cs.waikato.ac.nz/ml/weka/.

[13] Kittler J, Hatef M, Duin R P W, et al. On Combining Classifiers[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(3):226-239.

[14] John G H, Langley P. Estimating Continuous Distributions in Bayesian Classifiers[C]. In: Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence(UAI’95). San Mateo: Morgan Kaufmann, 1995:338-345.

[15] Quinlan R J. C4.5: Programs for Machine Learning[M]. San Francisco: Morgan Kaufmann Publishers Inc, 1993.

[16] Frank E. ZeroR API Document[EB/OL]. [2012-03-29].http://weka.sourceforge.net/doc.dev/weka/classifiers/rules/ ZeroR.html.

[17] WebSPHINX: A Personal, Customizable Web Crawler [EB/OL]. [2012-02-12].http://www.cs.cmu.edu/~rcm/websphinx/.

[18] Jsoup: Java HTML Parser [EB/OL]. [2012-03-29].http://jsoup.org/.

[19] The Apache Software Foundation. Apache Tika - a Content Analysis Toolkit[EB/OL]. [2012-03-29]. http://tika. apache. org /.

[20] Ik-Analyzer [EB/OL]. [2012-03-29]. http://code.google.com/p/ik-analyzer/.

[21] LingPipe [EB/OL]. [2012-03-29]. http://alias-i.com/lingpipe/.

[1]	齐云飞, 赵宇翔, 朱庆华. 关联数据在数字图书馆移动视觉搜索系统中的应用研究^*[J]. 数据分析与知识发现, 2017, 1(1): 81-90.
[2]	洪亮,钱晨,樊星. 移动数字图书馆资源的情境感知个性化推荐方法研究^*[J]. 现代图书情报技术, 2016, 32(7-8): 110-119.
[3]	刘健,毕强,马卓. 数字图书馆微服务评价指标体系构建及实证研究^*[J]. 现代图书情报技术, 2016, 32(5): 22-29.
[4]	王传清, 毕强. 数字图书馆自动化语义标注工具系统模型研究[J]. 现代图书情报技术, 2014, 30(6): 17-24.
[5]	尉萌. 利用演化模式做文献推荐[J]. 现代图书情报技术, 2014, 30(4): 20-26.
[6]	乔建忠. 一种基于改进BFS算法的主题搜索技术研究[J]. 现代图书情报技术, 2013, 29(7/8): 28-35.
[7]	胡昌平, 陈果. 共词分析中的词语贡献度特征选择研究[J]. 现代图书情报技术, 2013, 29(7/8): 89-93.
[8]	王忠义, 夏立新, 石义金, 郑森茂. 数字图书馆中层关联数据的创建与发布[J]. 现代图书情报技术, 2013, (5): 28-33.
[9]	刘炜, 夏翠娟, 张春景. 大数据与关联数据:正在到来的数据技术革命[J]. 现代图书情报技术, 2013, (4): 2-9.
[10]	周姗姗, 毕强, 高俊峰. 基于社会网络分析的信息检索结果可视化呈现方法研究[J]. 现代图书情报技术, 2013, 29(11): 81-85.
[11]	陈俊杰, 黄国凡. 移动图书馆APP的构建策略和关键技术——以iOS为例[J]. 现代图书情报技术, 2012, (9): 75-80.
[12]	董丽, 曾婷, 陈武, 姜爱蓉. 2011年亚太数字图书馆会议(ICADL 2011)综述[J]. 现代图书情报技术, 2012, 28(7): 33-39.
[13]	刘剑涛. 图书馆云翻译书签的设计与实现[J]. 现代图书情报技术, 2012, 28(6): 84-88.
[14]	徐树维. 同步协作检索结果的相关性判断策略[J]. 现代图书情报技术, 2012, 28(4): 41-47.
[15]	钱力, 张智雄, 邹益民, 黄永文. 信息可视化检索在数字图书馆中的应用实践[J]. 现代图书情报技术, 2012, 28(4): 74-78.

Viewed

Full text

Abstract

Cited

Shared

Discussed