Please wait a minute...
New Technology of Library and Information Service  2016, Vol. 32 Issue (1): 24-31    DOI: 10.11925/infotech.1003-3513.2016.01.05
Orginal Article Current Issue | Archive | Adv Search |
A Study on Hub Page Recognition Using URL Features
Ce Zhang1(),Yuncheng Du1,2,Ran Liang2
1Open Laboratory of TRS Software, Beijing Information Science and Technology University, Beijing 100085, China
2Beijing TRS Information Technology Co. Ltd., Beijing 100101, China
Download:
Export: BibTeX | EndNote (RIS)       Supporting Info
Guide   
Abstract  

[Objective] By building a simple data sample, the low efficiency as the problem of traditional recognition method is solved. [Methods] This method uses URL features as the basis of recognition, and uses Support Vector Machine (SVM) to recognize page type. [Results] The precision of this method is 91.2%, also in terms of efficiency performance, the method is increased by nearly 60%. [Limitations] When the URL feature is not obvious or even completely contrary, the recognition accuracy will be greatly reduced. [Conclusions] The experimental results show that the method has a great advantage in efficiency, and it will increase the efficiency of the collection system.

Key wordsURL features      Hub pages      SVM     
Received: 25 June 2015      Published: 04 February 2016

Cite this article:

Ce Zhang,Yuncheng Du,Ran Liang. A Study on Hub Page Recognition Using URL Features. New Technology of Library and Information Service, 2016, 32(1): 24-31.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2016.01.05     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2016/V32/I1/24

[1] 孟涛, 闫宏飞, 王继民. Web 网页信息变化的时间局部性规律及其验证[J]. 情报学报, 2005, 24(4): 398-406.
[1] (Meng Tao, Yan Hongfei, Wang Jimin.Characterizing Temporal Locality in Changes of Web Documents[J]. Journal of the China Society for Scientific and Technical Information, 2005, 24(4): 398-406.)
[2] 李晓明, 闫宏飞, 王继民. 搜索引擎——原理、技术与系统[M]. 北京:科学出版社, 2005.
[2] (Li Xiaoming, Yan Hongfei, Wang Jimin.Search Engine: Theory, Technology and System [M]. Beijing: Science Press, 2005.)
[3] Cho J, Garcia-Molina H.The Evolution of the Web and Implications for an Incremental Crawler[C]. In: Proceedings of the 26th International Conference on Very Large Data Bases, 2002.
[4] Meng T, Yan H, Wang J, et al.The Evolution of Link- attributes for Pages and Its Implications on Web Crawling[C]. In: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, 2004.
[5] Ali R, Beg N M S. An Overview of Web Search Evaluation Methods[J]. Computers & Electrical Engineering,2011,37(6): 835-848.
[6] 曹桂峰. 搜索引擎中网页分类和网页净化的研究与实现[D]. 武汉: 武汉理工大学, 2013.
[6] (Cao Guifeng.Design and Implement of Webpage Classify and Clean in Search Engine [D]. Wuhan: Wuhan University of Technology, 2013.)
[7] Zhang X, Zhou M, Geng G, et al.A Combined Feature Selection Method for Chinese Text Categorization [C]. In: Proceedings of the 2009 International Conference on Information Engineering and Computer Science, 2009.
[8] 谢光华. 中文网页自动分类的研究及其应用[D]. 大连: 大连理工大学,2007.
[8] (Xie Guanghua.Research and Application of Chinese Web Page Automatic Classification[J]. Journal of Dalian University of Technology, 2007.)
[9] Wang R J, Wang D J.Web Information Acquisition by Personal Search Engine Based on SVM[J]. International Journal of Information Acquisition, 2005, 2(4): 345-352.
[10] 庞剑锋, 卜东波, 白硕. 基于向量空间模型的文本自动分类系统的研究与实现[J]. 计算机应用研究, 2001, 18(9): 23-26.
[10] (Pang Jianfeng, Bu Dongbo, Bai Shuo.Research and Implementation of Text Categorization System Based on VSM[J]. Application Research of Computers, 2001, 18(9): 23-26.)
[11] 李亮, 刘万春, 徐泉清, 等. 一种基于支持向量机的专业中文网页分类器[J]. 计算机应用, 2004, 24(4): 58-61.
[11] (Li Liang, Liu Wanchun, Xu Quanqing, et al.A Professional Chinese Web Page Classifier Based on Support Vector Machine[J]. Computer Application, 2004, 24(4): 58-61.)
[12] 张学工. 关于统计学习理论与支持向量机[J]. 自动化学报, 2000, 26(1): 32-42.
[12] (Zhang Xuegong.Introduction to Statistical Learning Theory and Support Vector Machines[J]. Acta Automatica Sinica, 2000, 26(1): 32-42.)
[13] Chang C C, Lin C J. LIBSVM: A Library for Support Vector Machines [J]. Transactions on Intelligent Systems and Technology, 2011, 2(3): Article No.27.
[14] Jiang J, Song X, Yu N, et al.Focus: Learning to Crawl Web Forums[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(6): 1293-1306.
[15] Le A, Markopoulou A, Faloutsos M.PhishDef: URL Names Say It All [C]. In: Proceedings of the 30th IEEE International Conference on Computer Communications (INFOCOM), Shanghai, China. 2011.
[1] Shen Wang, Li Shiyu, Liu Jiayu, Li He. Optimizing Quality Evaluation for Answers of Q&A Community[J]. 数据分析与知识发现, 2021, 5(2): 83-93.
[2] Gong Lijuan,Wang Hao,Zhang Zixuan,Zhu Liping. Reducing Dimensions of Custom Declaration Texts with Word2Vec[J]. 数据分析与知识发现, 2020, 4(2/3): 89-100.
[3] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[4] Gang Li,Huayang Zhou,Jin Mao,Sijing Chen. Classifying Social Media Users with Machine Learning[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
[5] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[6] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[7] Hou Jun,Liu Kui,Li Qianmu. Classification Recommendation Based on ESSVM[J]. 数据分析与知识发现, 2018, 2(3): 9-21.
[8] Zhao Yang,Li Qiqi,Chen Yuhan,Cao Wenhang. Examining Consumer Reviews of Overseas Shopping APP with Sentiment Analysis[J]. 数据分析与知识发现, 2018, 2(11): 19-27.
[9] Tian Shihai,Lyu Deli. An Early Warning Algorithm for Public Opinion of Safety Emergency[J]. 数据分析与知识发现, 2017, 1(2): 11-18.
[10] Liu Hongguang,Ma Shuanggang,Liu Guifeng. Classifying Chinese News Texts with Denoising Auto Encoder[J]. 现代图书情报技术, 2016, 32(6): 12-19.
[11] Tang Xiangbin, Lu Wei, Zhang Xiaojuan, Huang Shihao. Feature Analysis and Automatic Identification of Query Specificity[J]. 现代图书情报技术, 2015, 31(2): 15-23.
[12] Hu Jiming, Chen Guo. Study on Improvement of Text Classification Using HS-SVM[J]. 现代图书情报技术, 2014, 30(9): 74-80.
[13] Liu Kan, Zhu Huaiping, Liu Xiuqin. Detection of Internet Deceptive Opinion Based on SVM[J]. 现代图书情报技术, 2013, 29(11): 75-80.
[14] Li Xiao, Ding Shengchun. Research on Review Spam Recognition[J]. 现代图书情报技术, 2013, 29(1): 63-68.
[15] Xu Jian, Wen Haosheng. Study on Talents Description Web Page Automatic Recognition System[J]. 现代图书情报技术, 2011, 27(6): 20-26.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn