|
|
A Study on Hub Page Recognition Using URL Features |
Ce Zhang1(),Yuncheng Du1,2,Ran Liang2 |
1Open Laboratory of TRS Software, Beijing Information Science and Technology University, Beijing 100085, China 2Beijing TRS Information Technology Co. Ltd., Beijing 100101, China |
|
|
Guide |
|
Abstract [Objective] By building a simple data sample, the low efficiency as the problem of traditional recognition method is solved. [Methods] This method uses URL features as the basis of recognition, and uses Support Vector Machine (SVM) to recognize page type. [Results] The precision of this method is 91.2%, also in terms of efficiency performance, the method is increased by nearly 60%. [Limitations] When the URL feature is not obvious or even completely contrary, the recognition accuracy will be greatly reduced. [Conclusions] The experimental results show that the method has a great advantage in efficiency, and it will increase the efficiency of the collection system.
|
Received: 25 June 2015
Published: 04 February 2016
|
[1] | 孟涛, 闫宏飞, 王继民. Web 网页信息变化的时间局部性规律及其验证[J]. 情报学报, 2005, 24(4): 398-406. | [1] | (Meng Tao, Yan Hongfei, Wang Jimin.Characterizing Temporal Locality in Changes of Web Documents[J]. Journal of the China Society for Scientific and Technical Information, 2005, 24(4): 398-406.) | [2] | 李晓明, 闫宏飞, 王继民. 搜索引擎——原理、技术与系统[M]. 北京:科学出版社, 2005. | [2] | (Li Xiaoming, Yan Hongfei, Wang Jimin.Search Engine: Theory, Technology and System [M]. Beijing: Science Press, 2005.) | [3] | Cho J, Garcia-Molina H.The Evolution of the Web and Implications for an Incremental Crawler[C]. In: Proceedings of the 26th International Conference on Very Large Data Bases, 2002. | [4] | Meng T, Yan H, Wang J, et al.The Evolution of Link- attributes for Pages and Its Implications on Web Crawling[C]. In: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, 2004. | [5] | Ali R, Beg N M S. An Overview of Web Search Evaluation Methods[J]. Computers & Electrical Engineering,2011,37(6): 835-848. | [6] | 曹桂峰. 搜索引擎中网页分类和网页净化的研究与实现[D]. 武汉: 武汉理工大学, 2013. | [6] | (Cao Guifeng.Design and Implement of Webpage Classify and Clean in Search Engine [D]. Wuhan: Wuhan University of Technology, 2013.) | [7] | Zhang X, Zhou M, Geng G, et al.A Combined Feature Selection Method for Chinese Text Categorization [C]. In: Proceedings of the 2009 International Conference on Information Engineering and Computer Science, 2009. | [8] | 谢光华. 中文网页自动分类的研究及其应用[D]. 大连: 大连理工大学,2007. | [8] | (Xie Guanghua.Research and Application of Chinese Web Page Automatic Classification[J]. Journal of Dalian University of Technology, 2007.) | [9] | Wang R J, Wang D J.Web Information Acquisition by Personal Search Engine Based on SVM[J]. International Journal of Information Acquisition, 2005, 2(4): 345-352. | [10] | 庞剑锋, 卜东波, 白硕. 基于向量空间模型的文本自动分类系统的研究与实现[J]. 计算机应用研究, 2001, 18(9): 23-26. | [10] | (Pang Jianfeng, Bu Dongbo, Bai Shuo.Research and Implementation of Text Categorization System Based on VSM[J]. Application Research of Computers, 2001, 18(9): 23-26.) | [11] | 李亮, 刘万春, 徐泉清, 等. 一种基于支持向量机的专业中文网页分类器[J]. 计算机应用, 2004, 24(4): 58-61. | [11] | (Li Liang, Liu Wanchun, Xu Quanqing, et al.A Professional Chinese Web Page Classifier Based on Support Vector Machine[J]. Computer Application, 2004, 24(4): 58-61.) | [12] | 张学工. 关于统计学习理论与支持向量机[J]. 自动化学报, 2000, 26(1): 32-42. | [12] | (Zhang Xuegong.Introduction to Statistical Learning Theory and Support Vector Machines[J]. Acta Automatica Sinica, 2000, 26(1): 32-42.) | [13] | Chang C C, Lin C J. LIBSVM: A Library for Support Vector Machines [J]. Transactions on Intelligent Systems and Technology, 2011, 2(3): Article No.27. | [14] | Jiang J, Song X, Yu N, et al.Focus: Learning to Crawl Web Forums[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(6): 1293-1306. | [15] | Le A, Markopoulou A, Faloutsos M.PhishDef: URL Names Say It All [C]. In: Proceedings of the 30th IEEE International Conference on Computer Communications (INFOCOM), Shanghai, China. 2011. |
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|