Abstract:This paper proposes a new Web page type relevance judgment strategy based on several statistical characteristics of Web document types to meet the online classification lightweight design requirements of focused crawler. Using the API provided by WEKA, this paper devises appropriate training algorithm and classification algorithm for the relevance judgment strategy. The experiments of classification accuracy, efficiency, and attribute selection demonstrate the validity of the relevance judgment strategy and five Web page statistical characteristics playing a key role in the type identification.
乔建忠. 一种基于统计特征面向“类型”主题抓取的网页相关性判断策略研究[J]. 现代图书情报技术, 2012, 28(6): 9-16.
Qiao Jianzhong. Statistical Characteristics Based Web Page Relevance Judgment Strategy for the “Type” Topics Crawled. New Technology of Library and Information Service, 2012, 28(6): 9-16.
[1] Chakrabarti S, Berg M V D, Dom B. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery[J]. Computer Networks: The International Journal of Computer and Telecommunications Networking, 1999, 31(11-16): 1623-1640.[2] Zhang Z Y, Nasraoui O. Profile-based Focused Crawler for Social Media-Sharing Websites[C].In: Proceedings of the 20th IEEE International Conference on Tools with Artificial Intelligence 2008(ICTAI’08). Washington: IEEE Computer Society, 2008:317-324.[3] 周立柱, 林玲.聚焦爬虫技术研究综述[J]. 计算机应用 ,2005, 25(9):1965-1969. (Zhou Lizhu, Lin Ling. Survey on the Research of Focused Crawling Technique[J].Journal of Computer Applications,2005,25(9):1965-1969.)[4] Hurst M, Maykov A. Social Streams Blog Crawler[C]. In: Proceedings of the 2009 IEEE International Conference on Data Engineering(ICDE ’09).Washington: IEEE Computer Society, 2009: 1615-1618.[5] Shchekotykhin K, Jannach D, Friedrich G. xCrawl: A High-recall Crawling Method for Web Mining[C]. In: Proceedings of the 8th IEEE International Conference on Data Mining(ICDM’08).Washington: IEEE Computer Society, 2008:550-559.[6] Gatterbauer W, Bohunsky P, Herzog M, et al. Towards Domain-independent Information Extraction from Web Tables[C]. In: Proceedings of the 16th International Conference on World Wide Web(WWW’07). New York: ACM, 2007:71-80.[7] Cai D, Yu S P, Wen J R, et al. VIPS: A Vision-based Page Segmentation Algorithm[R/OL]. [2012-03-10]. ftp://ftp.research.microsoft.com/pub/tr/tr-2003-79.pdf.[8] Kovacevic M, Diligenti M, Gori M, et al. Recognition of Common Areas in a Web Page Using Visual Information: A Possible Application in a Page Classification[C]. In: Proceedings of 2002 IEEE International Conference on Data Mining(ICDM’ 02). Washington: IEEE Computer Society, 2002:250.[9] Hnse M, Kan M Y, Karduck A P. Kairos: Proactive Harvesting of Research Paper Metadata from Scientific Conference Web Sites[C]. In: Proceedings of the Role of Digital Libraries in a Time of Global Change, and the 12th International Conference on Asia-Pacific Digital Libraries(ICADL 2010).Berlin: Springer-Verlag, 2010: 226-235.[10] 陈翰,周杰,李弼程.一种基于综合特征的网页类型识别方法[J]. 信息工程大学学报 ,2011,12(6):738-744.(Chen Han, Zhou Jie, Li Bicheng. Genre Recognition Method of Web Pages Based on Integral Features[J].Journal of Information Engineering University,2011,12(6):738-744.)[11] 吴思竹,张智雄.基于网页特征识别的噪音网页过滤方法研究[J]. 情报理论与实践 ,2011,34(4):111-114.(Wu Sizhu, Zhang Zhixiong. Research on Noise Web Page Filtering Method Based on Web Page Chatacteristics Identification[J]. Information Studies: Theory & Application, 2011,34(4):111-114.)[12] Machine Learning Group at University of Waikato.Weka 3: Data Mining Software in Java[EB/OL]. [2012-03-29].http://www.cs.waikato.ac.nz/ml/weka/.[13] Kittler J, Hatef M, Duin R P W, et al. On Combining Classifiers[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(3):226-239.[14] John G H, Langley P. Estimating Continuous Distributions in Bayesian Classifiers[C]. In: Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence(UAI’95). San Mateo: Morgan Kaufmann, 1995:338-345.[15] Quinlan R J. C4.5: Programs for Machine Learning[M]. San Francisco: Morgan Kaufmann Publishers Inc, 1993.[16] Frank E. ZeroR API Document[EB/OL]. [2012-03-29].http://weka.sourceforge.net/doc.dev/weka/classifiers/rules/ ZeroR.html.[17] WebSPHINX: A Personal, Customizable Web Crawler [EB/OL]. [2012-02-12].http://www.cs.cmu.edu/~rcm/websphinx/.[18] Jsoup: Java HTML Parser [EB/OL]. [2012-03-29].http://jsoup.org/.[19] The Apache Software Foundation. Apache Tika - a Content Analysis Toolkit[EB/OL]. [2012-03-29]. http://tika. apache. org /.[20] Ik-Analyzer [EB/OL]. [2012-03-29]. http://code.google.com/p/ik-analyzer/.[21] LingPipe [EB/OL]. [2012-03-29]. http://alias-i.com/lingpipe/.