Please wait a minute...
Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (4): 71-80    DOI: 10.11925/infotech.2096-3467.2017.1188
Orginal Article Current Issue | Archive | Adv Search |
Identifying Malicious Websites with PCA and Random Forest Methods
Yuan Chen,Chaoqun Wang,Zhongyi Hu(),Jiang Wu
School of Information Management, Wuhan University, Wuhan 430072, China
The Center for Electronic Commerce Research and Development, Wuhan University, Wuhan 430072, China
Download: PDF(1756 KB)   HTML ( 2
Export: BibTeX | EndNote (RIS)      

[Objective] This study aims to assess and identify malicious websites with the help of multi-source evaluation metrics. [Methods] We used the principal component analysis (PCA) to conduct a multi-dimensional assessment of malicious websites based on multi-source metrics of websites. Then, we built a malicious site identification model using random forest based on the assessment. [Results] We found that the PCA could effectively extract five assessment dimensions: authority, references, website traffic, ranking, and links. Meanwhile, the identification model was accurate and efficient. [Limitations] Most of the samples in this study were foreign websites, which means the extracted dimensions may be different from those in China. Additionally, we did not study the ratio of malicious to normal websites. [Conclusions] The proposed model could effectively extract dimensions for website assessment and then identifies the malicious ones.

Key wordsMalicious Websites      Assessment and Identification      Principal Component Analysis      Random Forest     
Received: 24 November 2017      Published: 11 May 2018

Cite this article:

Yuan Chen,Chaoqun Wang,Zhongyi Hu,Jiang Wu. Identifying Malicious Websites with PCA and Random Forest Methods. Data Analysis and Knowledge Discovery, 2018, 2(4): 71-80.

URL:     OR

[1] Sheng S, Weidman B, Warner G, et al.An Empirical Analysis of Phishing Blacklists[C]//Proceedings of the 6th Conference on Email and Anti-Spam, California, USA. 2009: 112-118.
[2] Zhang Y, Egelman S, Cranor L, et al.Phinding Phish: Evaluating Anti-phishing Tools[C]//Proceedings of the 14th Annual Network and Distributed System Security Symposium. 2007: 381-192.
[3] 黄华军, 钱亮, 王耀钧. 基于异常特征的钓鱼网站 URL 检测技术[J]. 信息网络安全, 2012 (1): 23-25.
[3] (Huang Huajun, Qian Liang, Wang Yaojun.Detection of Phishing URL Based on Abnormal Feature[J]. Netinfo Security, 2012(1): 23-25.)
[4] Chiew K L, Chang E H, Sze S N, et al.Utilisation of Website Logo for Phishing Detection[J]. Computers & Security, 2015, 54: 16-26.
[5] Hu Z, Chiong R, Pranata I, et al.Identifying Malicious Web Domains Using Machine Learning Techniques with Online Credibility and Performance Data[C]//Proceedings of the 2016 IEEE Congress on Evolutionary Computation (CEC), Vancouver, Canada. 2016: 5186-5194.
[6] 马威. 网站恶意性评估系统设计与实现[D]. 北京: 北京交通大学, 2010.
[6] (Ma Wei.The Design and Implementation of Website Malice Assessing System[D]. Beijing: Beijing Jiaotong University, 2010.)
[7] Purkait S.Examining the Effectiveness of Phishing Filters Against DNS Based Phishing Attacks, Information & Computer Security[J]. Information & Computer Security, 2015, 23(3): 333-346.
[8] 曾传璜, 李思强, 张小红. 基于AdaCostBoost 算法的网络钓鱼检测[J]. 计算机系统应用, 2015, 24(9): 129-133.
[8] (Zeng Chuanhuang, Li Siqiang, Zhang Xiaohong.Phishing Detection System Based on AdaCostBoost Algorithm[J]. Computer Systems & Applications, 2015, 24(9): 129-133.)
[9] Abdelhamid N.Multi-label Rules for Phishing Classification[J]. Applied Computing and Informatics, 2015, 11(1): 29-46.
[10] Abutair H Y A, Belghith A. Using Case-Based Reasoning for Phishing Detection[J]. Procedia Computer Science, 2017, 109: 281-288.
[11] Moghimi M, Varjani A Y.New Rule-based Phishing Detection Method[J]. Expert Systems with Applications, 2016, 53: 231-242.
[12] Yang X, Yan L, Yang B, et al.Phishing Website Detection Using C4.5 Decision Tree[C]//Proceedings of the 2nd International Conference on Information Technology and Management Engineering, Beijing, China. 2017.
[13] Tan C L, Kang L C, Wong K S, et al.PhishWHO: Phishing Webpage Detection via Identity Keywords Extraction and Target Domain Name Finder[J]. Decision Support Systems, 2016, 88: 18-27.
[14] 庄蔚蔚, 叶艳芳, 李涛, 等. 基于分类集成的钓鱼网站智能检测系统[J]. 系统工程理论实践, 2011, 31(10): 2008-2020.
[14] (Zhuang Weiwei, Ye Yanfang, Li Tao, et al.Intelligent Phishing Website Detection Using Classification Ensemble[J]. Systems Engineering-Theory & Practice, 2011, 31(10): 2008-2020.)
[15] 魏玉良. 基于主动探测的仿冒网站检测系统设计与实现[D]. 哈尔滨: 哈尔滨工业大学, 2014.
[15] (Wei Yuliang.Design and Implementation Phishing Detecting System Based on Active Detection[D]. Harbin: Harbin Institute of Technology, 2014.)
[16] 杨明星. 基于登录页面及Logo图标检测的反钓鱼方案[D]. 太原: 太原理工大学, 2015.
[16] (Yang Mingxing.An Anti- phishing Scheme Based on Login Page Detection and Logo Identification[D]. Taiyuan: Taiyuan University of Technology, 2015.)
[17] 朱百禄. 基于Web社区的钓鱼网站检测研究[D]. 天津: 天津理工大学, 2013.
[17] (Zhu Bailu.A Method of Phishing Detection Based on Web Community[D]. Tianjin: Tianjin University of Technology, 2013.)
[18] Zhang W, Lu H, Xu B, et al.Web Phishing Detection Based on Page Spatial Layout Similarity[J]. Informatica, 2013, 37(3): 231-244.
[19] Islam R, Abawajy J.A Multi-tier Phishing Detection and Filtering Approach[J]. Journal of Network and Computer Applications, 2013, 36(1): 324-335.
[20] 林海明, 杜子芳. 主成分分析综合评价应该注意的问题[J]. 统计研究, 2013, 30(8): 25-31.
[20] (Lin Haiming, Du Zifang.Some Problems in Comprehensive Evaluation in the Principal Component Analysis[J]. Statistical Research, 2013, 30(8): 25-31.)
[21] Breiman L.Random Forests[J]. Machine Learning, 2001, 45(1): 5-32.
[22] 薛薇. SPSS统计分析方法及应用[M].第3版. 北京: 电子工业出版社, 2013.
[22] (Xue Wei.SPSS Statistical Analysis Method and Application[M]. The 3rd Edition. Beijing: Publishing House of Electronics Industry, 2013.)
[23] Dem?ar J.Statistical Comparisons of Classifiers over Multiple Data Sets[J]. Journal of Machine Learning Research, 2006, 7(1): 1-30.
[1] Wancheng Chen,Haoran Dai,Yinghan Jin. Appraising Home Prices with HEDONIC Model: Case Study of Seattle, U.S.[J]. 数据分析与知识发现, 2019, 3(5): 19-26.
[2] Cheng Zhou,Hongqin Wei. Identifying Crowd Participants with Modified Random Forests Algorithm[J]. 数据分析与知识发现, 2018, 2(7): 46-54.
[3] Liyi Zhang,Yiran Li,Xuan Wen. Predicting Repeat Purchase Intention of New Consumers[J]. 数据分析与知识发现, 2018, 2(11): 10-18.
[4] Weimin Lv,Xiaomei Wang,Tao Han. Recommending Scientific Research Collaborators with Link Prediction and Extremely Randomized Trees Algorithm[J]. 数据分析与知识发现, 2017, 1(4): 38-45.
[5] Xinwei Yuan,Shaohua Yang,Chaochao Wang,Zhanhe Du. Identifying Lead Players of User Innovation Communities Based on Feature Extraction and Random Forest Classification[J]. 数据分析与知识发现, 2017, 1(11): 62-74.
[6] Zhang Liyi, Zhang Jiao. A Brusher Detection Method Based on Principle Component Analysis and Random Forest[J]. 现代图书情报技术, 2015, 31(10): 65-71.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938