Please wait a minute...
New Technology of Library and Information Service  2012, Vol. 28 Issue (6): 9-16    DOI: 10.11925/infotech.1003-3513.2012.06.02
Current Issue | Archive | Adv Search |
Statistical Characteristics Based Web Page Relevance Judgment Strategy for the “Type” Topics Crawled
Qiao Jianzhong
Information Management Center of PLA Academy of Arts, Beijing 100081, China
Download: PDF(604 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  This paper proposes a new Web page type relevance judgment strategy based on several statistical characteristics of Web document types to meet the online classification lightweight design requirements of focused crawler. Using the API provided by WEKA, this paper devises appropriate training algorithm and classification algorithm for the relevance judgment strategy. The experiments of classification accuracy, efficiency, and attribute selection demonstrate the validity of the relevance judgment strategy and five Web page statistical characteristics playing a key role in the type identification.
Key wordsRelevance judgment strategy      Focused crawler      Focused crawling      Digital library     
Received: 16 May 2012      Published: 30 August 2012
: 

G250.73

 

Cite this article:

Qiao Jianzhong. Statistical Characteristics Based Web Page Relevance Judgment Strategy for the “Type” Topics Crawled. New Technology of Library and Information Service, 2012, 28(6): 9-16.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2012.06.02     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2012/V28/I6/9

[1] Chakrabarti S, Berg M V D, Dom B. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery[J]. Computer Networks: The International Journal of Computer and Telecommunications Networking, 1999, 31(11-16): 1623-1640.

[2] Zhang Z Y, Nasraoui O. Profile-based Focused Crawler for Social Media-Sharing Websites[C].In: Proceedings of the 20th IEEE International Conference on Tools with Artificial Intelligence 2008(ICTAI’08). Washington: IEEE Computer Society, 2008:317-324.

[3] 周立柱, 林玲.聚焦爬虫技术研究综述[J]. 计算机应用 ,2005, 25(9):1965-1969. (Zhou Lizhu, Lin Ling. Survey on the Research of Focused Crawling Technique[J].Journal of Computer Applications,2005,25(9):1965-1969.)

[4] Hurst M, Maykov A. Social Streams Blog Crawler[C]. In: Proceedings of the 2009 IEEE International Conference on Data Engineering(ICDE ’09).Washington: IEEE Computer Society, 2009: 1615-1618.

[5] Shchekotykhin K, Jannach D, Friedrich G. xCrawl: A High-recall Crawling Method for Web Mining[C]. In: Proceedings of the 8th IEEE International Conference on Data Mining(ICDM’08).Washington: IEEE Computer Society, 2008:550-559.

[6] Gatterbauer W, Bohunsky P, Herzog M, et al. Towards Domain-independent Information Extraction from Web Tables[C]. In: Proceedings of the 16th International Conference on World Wide Web(WWW’07). New York: ACM, 2007:71-80.

[7] Cai D, Yu S P, Wen J R, et al. VIPS: A Vision-based Page Segmentation Algorithm[R/OL]. [2012-03-10]. ftp://ftp.research.microsoft.com/pub/tr/tr-2003-79.pdf.

[8] Kovacevic M, Diligenti M, Gori M, et al. Recognition of Common Areas in a Web Page Using Visual Information: A Possible Application in a Page Classification[C]. In: Proceedings of 2002 IEEE International Conference on Data Mining(ICDM’ 02). Washington: IEEE Computer Society, 2002:250.

[9] Hnse M, Kan M Y, Karduck A P. Kairos: Proactive Harvesting of Research Paper Metadata from Scientific Conference Web Sites[C]. In: Proceedings of the Role of Digital Libraries in a Time of Global Change, and the 12th International Conference on Asia-Pacific Digital Libraries(ICADL 2010).Berlin: Springer-Verlag, 2010: 226-235.

[10] 陈翰,周杰,李弼程.一种基于综合特征的网页类型识别方法[J]. 信息工程大学学报 ,2011,12(6):738-744.(Chen Han, Zhou Jie, Li Bicheng. Genre Recognition Method of Web Pages Based on Integral Features[J].Journal of Information Engineering University,2011,12(6):738-744.)

[11] 吴思竹,张智雄.基于网页特征识别的噪音网页过滤方法研究[J]. 情报理论与实践 ,2011,34(4):111-114.(Wu Sizhu, Zhang Zhixiong. Research on Noise Web Page Filtering Method Based on Web Page Chatacteristics Identification[J]. Information Studies: Theory & Application, 2011,34(4):111-114.)

[12] Machine Learning Group at University of Waikato.Weka 3: Data Mining Software in Java[EB/OL]. [2012-03-29].http://www.cs.waikato.ac.nz/ml/weka/.

[13] Kittler J, Hatef M, Duin R P W, et al. On Combining Classifiers[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(3):226-239.

[14] John G H, Langley P. Estimating Continuous Distributions in Bayesian Classifiers[C]. In: Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence(UAI’95). San Mateo: Morgan Kaufmann, 1995:338-345.

[15] Quinlan R J. C4.5: Programs for Machine Learning[M]. San Francisco: Morgan Kaufmann Publishers Inc, 1993.

[16] Frank E. ZeroR API Document[EB/OL]. [2012-03-29].http://weka.sourceforge.net/doc.dev/weka/classifiers/rules/ ZeroR.html.

[17] WebSPHINX: A Personal, Customizable Web Crawler [EB/OL]. [2012-02-12].http://www.cs.cmu.edu/~rcm/websphinx/.

[18] Jsoup: Java HTML Parser [EB/OL]. [2012-03-29].http://jsoup.org/.

[19] The Apache Software Foundation. Apache Tika - a Content Analysis Toolkit[EB/OL]. [2012-03-29]. http://tika. apache. org /.

[20] Ik-Analyzer [EB/OL]. [2012-03-29]. http://code.google.com/p/ik-analyzer/.

[21] LingPipe [EB/OL]. [2012-03-29]. http://alias-i.com/lingpipe/.
[1] Yunfei Qi,Yuxiang Zhao,Qinghua Zhu. Linked Data for Mobile Visual Search System of Digital Library[J]. 数据分析与知识发现, 2017, 1(1): 81-90.
[2] Hong Liang,Qian Chen,Fan Xing. Context-aware Recommendation System for Mobile Digital Libraries[J]. 现代图书情报技术, 2016, 32(7-8): 110-119.
[3] Liu Jian,Bi Qiang,Ma Zhuo. Assessment of Digital Library’s Micro-services: An Empirical Study[J]. 现代图书情报技术, 2016, 32(5): 22-29.
[4] Chen Guo, Hu Changping. Research on the Structural Features of Keyword Network of Scientific Research Areas:An Empirical Study of LIS[J]. 现代图书情报技术, 2014, 30(7): 84-91.
[5] Xiong Yongjun, Yuan Xiaoyi. Design and Implementation of Automatic Monitoring System about Library Document Database Running State[J]. 现代图书情报技术, 2014, 30(7): 127-132.
[6] Wang Chuanqing, Bi Qiang. System Model of Digital Library Automatic Semantic Annotation Tool[J]. 现代图书情报技术, 2014, 30(6): 17-24.
[7] Wei Meng. Literature Recommendation Using Evolution Patterns[J]. 现代图书情报技术, 2014, 30(4): 20-26.
[8] Qiao Jianzhong. An Improved Best-First Search Algorithm Based Focused Crawling Research[J]. 现代图书情报技术, 2013, 29(7/8): 28-35.
[9] Hu Changping, Chen Guo. A New Feature Selection Method Based on Term Contribution in Co-word Analysis[J]. 现代图书情报技术, 2013, 29(7/8): 89-93.
[10] Wang Zhongyi, Xia Lixin, Shi Yijin, Zheng Senmao. The Creation and Publishing of Middle Linked Data in Digital Library[J]. 现代图书情报技术, 2013, (5): 28-33.
[11] Liu Wei, Xia Cuijuan, Zhang Chunjing. Big Data and Linked Data: The Emerging Data Technology for the Future of Librarianship[J]. 现代图书情报技术, 2013, (4): 2-9.
[12] Zhou Shanshan, Bi Qiang, Gao Junfeng. A Method of Information Retrieval Results Visualization Based on Social Network Analysis[J]. 现代图书情报技术, 2013, 29(11): 81-85.
[13] Chen Junjie, Huang Guofan. Construction Strategy and Main Technology of the Mobile Library APP——Take iOS for Instance[J]. 现代图书情报技术, 2012, (9): 75-80.
[14] Dong Li, Zeng Ting, Chen Wu, Jiang Airong. A Review of ICADL 2011[J]. 现代图书情报技术, 2012, 28(7): 33-39.
[15] Liu Jiantao. Implement of Translation and Interpretation Bookmarklet in Library Based on Cloud Service[J]. 现代图书情报技术, 2012, 28(6): 84-88.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn