A Study on Hub Page Recognition Using URL Features
Ce Zhang1(),Yuncheng Du1,2,Ran Liang2
1Open Laboratory of TRS Software, Beijing Information Science and Technology University, Beijing 100085, China 2Beijing TRS Information Technology Co. Ltd., Beijing 100101, China
[Objective] By building a simple data sample, the low efficiency as the problem of traditional recognition method is solved. [Methods] This method uses URL features as the basis of recognition, and uses Support Vector Machine (SVM) to recognize page type. [Results] The precision of this method is 91.2%, also in terms of efficiency performance, the method is increased by nearly 60%. [Limitations] When the URL feature is not obvious or even completely contrary, the recognition accuracy will be greatly reduced. [Conclusions] The experimental results show that the method has a great advantage in efficiency, and it will increase the efficiency of the collection system.
张策,都云程,梁然. 采用URL特征的Hub网页识别方法研究*[J]. 现代图书情报技术, 2016, 32(1): 24-31.
Ce Zhang,Yuncheng Du,Ran Liang. A Study on Hub Page Recognition Using URL Features. New Technology of Library and Information Service, 2016, 32(1): 24-31.
孟涛, 闫宏飞, 王继民. Web 网页信息变化的时间局部性规律及其验证[J]. 情报学报, 2005, 24(4): 398-406.
[1]
(Meng Tao, Yan Hongfei, Wang Jimin.Characterizing Temporal Locality in Changes of Web Documents[J]. Journal of the China Society for Scientific and Technical Information, 2005, 24(4): 398-406.)
[2]
李晓明, 闫宏飞, 王继民. 搜索引擎——原理、技术与系统[M]. 北京:科学出版社, 2005.
[2]
(Li Xiaoming, Yan Hongfei, Wang Jimin.Search Engine: Theory, Technology and System [M]. Beijing: Science Press, 2005.)
[3]
Cho J, Garcia-Molina H.The Evolution of the Web and Implications for an Incremental Crawler[C]. In: Proceedings of the 26th International Conference on Very Large Data Bases, 2002.
[4]
Meng T, Yan H, Wang J, et al.The Evolution of Link- attributes for Pages and Its Implications on Web Crawling[C]. In: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, 2004.
[5]
Ali R, Beg N M S. An Overview of Web Search Evaluation Methods[J]. Computers & Electrical Engineering,2011,37(6): 835-848.
[6]
曹桂峰. 搜索引擎中网页分类和网页净化的研究与实现[D]. 武汉: 武汉理工大学, 2013.
[6]
(Cao Guifeng.Design and Implement of Webpage Classify and Clean in Search Engine [D]. Wuhan: Wuhan University of Technology, 2013.)
[7]
Zhang X, Zhou M, Geng G, et al.A Combined Feature Selection Method for Chinese Text Categorization [C]. In: Proceedings of the 2009 International Conference on Information Engineering and Computer Science, 2009.
[8]
谢光华. 中文网页自动分类的研究及其应用[D]. 大连: 大连理工大学,2007.
[8]
(Xie Guanghua.Research and Application of Chinese Web Page Automatic Classification[J]. Journal of Dalian University of Technology, 2007.)
[9]
Wang R J, Wang D J.Web Information Acquisition by Personal Search Engine Based on SVM[J]. International Journal of Information Acquisition, 2005, 2(4): 345-352.
(Pang Jianfeng, Bu Dongbo, Bai Shuo.Research and Implementation of Text Categorization System Based on VSM[J]. Application Research of Computers, 2001, 18(9): 23-26.)
(Li Liang, Liu Wanchun, Xu Quanqing, et al.A Professional Chinese Web Page Classifier Based on Support Vector Machine[J]. Computer Application, 2004, 24(4): 58-61.)
(Zhang Xuegong.Introduction to Statistical Learning Theory and Support Vector Machines[J]. Acta Automatica Sinica, 2000, 26(1): 32-42.)
[13]
Chang C C, Lin C J. LIBSVM: A Library for Support Vector Machines [J]. Transactions on Intelligent Systems and Technology, 2011, 2(3): Article No.27.
[14]
Jiang J, Song X, Yu N, et al.Focus: Learning to Crawl Web Forums[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(6): 1293-1306.
[15]
Le A, Markopoulou A, Faloutsos M.PhishDef: URL Names Say It All [C]. In: Proceedings of the 30th IEEE International Conference on Computer Communications (INFOCOM), Shanghai, China. 2011.