Please wait a minute...
Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (4): 71-80    DOI: 10.11925/infotech.2096-3467.2017.1188
Orginal Article Current Issue | Archive | Adv Search |
Identifying Malicious Websites with PCA and Random Forest Methods
Chen Yuan, Wang Chaoqun, Hu Zhongyi(), Wu Jiang
School of Information Management, Wuhan University, Wuhan 430072, China
The Center for Electronic Commerce Research and Development, Wuhan University, Wuhan 430072, China
Download: PDF (1756 KB)   HTML ( 4
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study aims to assess and identify malicious websites with the help of multi-source evaluation metrics. [Methods] We used the principal component analysis (PCA) to conduct a multi-dimensional assessment of malicious websites based on multi-source metrics of websites. Then, we built a malicious site identification model using random forest based on the assessment. [Results] We found that the PCA could effectively extract five assessment dimensions: authority, references, website traffic, ranking, and links. Meanwhile, the identification model was accurate and efficient. [Limitations] Most of the samples in this study were foreign websites, which means the extracted dimensions may be different from those in China. Additionally, we did not study the ratio of malicious to normal websites. [Conclusions] The proposed model could effectively extract dimensions for website assessment and then identifies the malicious ones.

Key wordsMalicious Websites      Assessment and Identification      Principal Component Analysis      Random Forest     
Received: 24 November 2017      Published: 11 May 2018
ZTFLH:  G353  

Cite this article:

Chen Yuan,Wang Chaoqun,Hu Zhongyi,Wu Jiang. Identifying Malicious Websites with PCA and Random Forest Methods. Data Analysis and Knowledge Discovery, 2018, 2(4): 71-80.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.1188     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2018/V2/I4/71

指标来源 指标名称 指标含义
Moz Moz’s Domain Authority Moz公司对域名在搜索引擎中排名的预测
Moz’s total backlinks 网站的所有反向链接
MozRank 链接流行度评分
Majestic Majestic’s Citation Flow 通过引用排名, 度量引用来源
Majestic’s Trust Flow 通过衡量一个网站和可信赖网站的亲密程度, 度量信任来源
Majestic’s backlinks 网站反向链接的指标
Majestic’s reference domains 外部链接指向当前网站的个数
Google Google’s Page Rank Google通过网站之间的超链接关系确定的网站排行榜
Google’s Page Speed Google评估网页加载速度的指标
Alexa Alexa’s rank 通过网站的访问量确定网站排名
Alexa’s 1 month reach 网站最近1个月的平均每天访问量
Alexa’s 3 month reach 网站最近3个月的平均每天访问量
Alexa’s median load 使用Alexa特有的算法计算出的页面的平均加载速度
社交网站 Facebook shares 在Facebook的受欢迎程度
Twitter tweets 在Twitter的受欢迎程度
Google plus shares 在Google Plus的受欢迎程度
变量 特征值 贡献率 累积贡献率
1 5.12214596 0.3201341227 0.3201341
2 2.39191511 0.1494946944 0.4696288
3 1.87757791 0.1173486195 0.5869774
4 1.16274401 0.0726715004 0.6596489
5 1.01827300 0.0636420626 0.7232910
6 0.92795486 0.0579971785 0.7812882
7 0.83805885 0.0523786779 0.8336669
8 0.57201780 0.0357511122 0.8694180
9 0.55509074 0.0346931711 0.9041111
10 0.51360908 0.0321005675 0.9362117
11 0.30381803 0.0189886270 0.9552003
12 0.24950388 0.0155939926 0.9707943
13 0.22731629 0.0142072679 0.9850016
14 0.16495318 0.0103095735 0.9953112
15 0.06485261 0.0040532884 0.9993645
16 0.01016870 0.0006355439 1.0000000
变量 RC1 RC2 RC3 RC4 RC5 h2 u2
MozDomain
Authority
0.88 0.09 0.09 -0.01 0.03 0.80 0.2034
MozTotalBacklinks 0.08 0.13 -0.02 -0.04 0.88 0.80 0.1994
MozRank 0.86 0.06 0.03 0.08 0.04 0.74 0.2556
GooglePageRank 0.91 0.07 0.08 -0.01 0.00 0.83 0.1695
FacebookShares -0.02 0.79 0.02 0.04 -0.10 0.64 0.3572
TwitterTweets 0.08 0.78 -0.01 0.00 -0.11 0.62 0.3798
GooglePlusShares 0.32 0.13 -0.08 -0.24 -0.30 0.27 0.7308
AlexaMedianLoad 0.53 0.04 0.11 0.53 -0.03 0.57 0.4283
AlexaRanks 0.00 0.00 -0.05 0.90 0.00 0.81 0.1931
Alexa1MthReach 0.09 -0.01 0.99 0.00 0.00 0.99 0.0097
Alexa3MthReach 0.08 0.00 0.99 0.00 0.00 0.99 0.0110
GooglePageSpeed 0.42 -0.03 -0.03 0.18 -0.03 0.21 0.7901
MajesticCitation
Flow
0.93 0.16 0.05 -0.03 0.08 0.90 0.1026
MajesticTrustFlow 0.92 0.15 0.08 -0.07 0.07 0.88 0.1170
MajesticBacklinks 0.17 0.73 -0.02 -0.04 0.40 0.73 0.2710
MajesticReference domains 0.21 0.77 -0.02 -0.05 0.40 0.79 0.2088
判断是正常网站 判断是钓鱼网站
实际是正常网站 TN FP
实际是钓鱼网站 FN TP
准确率 查准率 查全率 F值
0.91 0.90 0.92 0.91
算法 F值
混合模型 0.91
AdaBoost 0.94
Bagging 0.92
朴素贝叶斯 0.80
随机森林 0.94
决策树 0.89
K近邻法 0.91
神经网络 0.88
SVM 0.91
算法对比 p值
混合模型-AdaBoost 0.00**
混合模型-Bagging 2.56E-04**
混合模型-朴素贝叶斯 1.67E-04**
混合模型-随机森林 0.00**
混合模型-决策树 0.55
混合模型-K近邻法 0.74
混合模型-神经网络 0.24
混合模型-SVM 0.13
[1] Sheng S, Weidman B, Warner G, et al.An Empirical Analysis of Phishing Blacklists[C]//Proceedings of the 6th Conference on Email and Anti-Spam, California, USA. 2009: 112-118.
[2] Zhang Y, Egelman S, Cranor L, et al.Phinding Phish: Evaluating Anti-phishing Tools[C]//Proceedings of the 14th Annual Network and Distributed System Security Symposium. 2007: 381-192.
[3] 黄华军, 钱亮, 王耀钧. 基于异常特征的钓鱼网站 URL 检测技术[J]. 信息网络安全, 2012 (1): 23-25.
[3] (Huang Huajun, Qian Liang, Wang Yaojun.Detection of Phishing URL Based on Abnormal Feature[J]. Netinfo Security, 2012(1): 23-25.)
[4] Chiew K L, Chang E H, Sze S N, et al.Utilisation of Website Logo for Phishing Detection[J]. Computers & Security, 2015, 54: 16-26.
doi: 10.1016/j.cose.2015.07.006
[5] Hu Z, Chiong R, Pranata I, et al.Identifying Malicious Web Domains Using Machine Learning Techniques with Online Credibility and Performance Data[C]//Proceedings of the 2016 IEEE Congress on Evolutionary Computation (CEC), Vancouver, Canada. 2016: 5186-5194.
[6] 马威. 网站恶意性评估系统设计与实现[D]. 北京: 北京交通大学, 2010.
[6] (Ma Wei.The Design and Implementation of Website Malice Assessing System[D]. Beijing: Beijing Jiaotong University, 2010.)
[7] Purkait S.Examining the Effectiveness of Phishing Filters Against DNS Based Phishing Attacks, Information & Computer Security[J]. Information & Computer Security, 2015, 23(3): 333-346.
doi: 10.1108/ICS-02-2013-0009
[8] 曾传璜, 李思强, 张小红. 基于AdaCostBoost 算法的网络钓鱼检测[J]. 计算机系统应用, 2015, 24(9): 129-133.
[8] (Zeng Chuanhuang, Li Siqiang, Zhang Xiaohong.Phishing Detection System Based on AdaCostBoost Algorithm[J]. Computer Systems & Applications, 2015, 24(9): 129-133.)
[9] Abdelhamid N.Multi-label Rules for Phishing Classification[J]. Applied Computing and Informatics, 2015, 11(1): 29-46.
doi: 10.1016/j.aci.2014.07.002
[10] Abutair H Y A, Belghith A. Using Case-Based Reasoning for Phishing Detection[J]. Procedia Computer Science, 2017, 109: 281-288.
doi: 10.1016/j.procs.2017.05.352
[11] Moghimi M, Varjani A Y.New Rule-based Phishing Detection Method[J]. Expert Systems with Applications, 2016, 53: 231-242.
doi: 10.1016/j.eswa.2016.01.028
[12] Yang X, Yan L, Yang B, et al.Phishing Website Detection Using C4.5 Decision Tree[C]//Proceedings of the 2nd International Conference on Information Technology and Management Engineering, Beijing, China. 2017.
[13] Tan C L, Kang L C, Wong K S, et al.PhishWHO: Phishing Webpage Detection via Identity Keywords Extraction and Target Domain Name Finder[J]. Decision Support Systems, 2016, 88: 18-27.
doi: 10.1016/j.dss.2016.05.005
[14] 庄蔚蔚, 叶艳芳, 李涛, 等. 基于分类集成的钓鱼网站智能检测系统[J]. 系统工程理论实践, 2011, 31(10): 2008-2020.
[14] (Zhuang Weiwei, Ye Yanfang, Li Tao, et al.Intelligent Phishing Website Detection Using Classification Ensemble[J]. Systems Engineering-Theory & Practice, 2011, 31(10): 2008-2020.)
[15] 魏玉良. 基于主动探测的仿冒网站检测系统设计与实现[D]. 哈尔滨: 哈尔滨工业大学, 2014.
[15] (Wei Yuliang.Design and Implementation Phishing Detecting System Based on Active Detection[D]. Harbin: Harbin Institute of Technology, 2014.)
[16] 杨明星. 基于登录页面及Logo图标检测的反钓鱼方案[D]. 太原: 太原理工大学, 2015.
[16] (Yang Mingxing.An Anti- phishing Scheme Based on Login Page Detection and Logo Identification[D]. Taiyuan: Taiyuan University of Technology, 2015.)
[17] 朱百禄. 基于Web社区的钓鱼网站检测研究[D]. 天津: 天津理工大学, 2013.
[17] (Zhu Bailu.A Method of Phishing Detection Based on Web Community[D]. Tianjin: Tianjin University of Technology, 2013.)
[18] Zhang W, Lu H, Xu B, et al.Web Phishing Detection Based on Page Spatial Layout Similarity[J]. Informatica, 2013, 37(3): 231-244.
[19] Islam R, Abawajy J.A Multi-tier Phishing Detection and Filtering Approach[J]. Journal of Network and Computer Applications, 2013, 36(1): 324-335.
doi: 10.1016/j.jnca.2012.05.009
[20] 林海明, 杜子芳. 主成分分析综合评价应该注意的问题[J]. 统计研究, 2013, 30(8): 25-31.
doi: 10.3969/j.issn.1002-4565.2013.08.004
[20] (Lin Haiming, Du Zifang.Some Problems in Comprehensive Evaluation in the Principal Component Analysis[J]. Statistical Research, 2013, 30(8): 25-31.)
doi: 10.3969/j.issn.1002-4565.2013.08.004
[21] Breiman L.Random Forests[J]. Machine Learning, 2001, 45(1): 5-32.
doi: 10.1023/A:1010933404324
[22] 薛薇. SPSS统计分析方法及应用[M].第3版. 北京: 电子工业出版社, 2013.
[22] (Xue Wei.SPSS Statistical Analysis Method and Application[M]. The 3rd Edition. Beijing: Publishing House of Electronics Industry, 2013.)
[23] Demšar J.Statistical Comparisons of Classifiers over Multiple Data Sets[J]. Journal of Machine Learning Research, 2006, 7(1): 1-30.
[1] Liu Yuanchen, Wang Hao, Gao Yaqi. Predicting Online Music Playbacks and Influencing Factors[J]. 数据分析与知识发现, 2021, 5(8): 100-112.
[2] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[3] Huiying Qi,Yuhe Jiang. Predicting Breast Cancer Survival Length with Multi-Omics Data Fusion[J]. 数据分析与知识发现, 2019, 3(8): 88-93.
[4] Wancheng Chen,Haoran Dai,Yinghan Jin. Appraising Home Prices with HEDONIC Model: Case Study of Seattle, U.S.[J]. 数据分析与知识发现, 2019, 3(5): 19-26.
[5] Zhou Cheng,Wei Hongqin. Identifying Crowd Participants with Modified Random Forests Algorithm[J]. 数据分析与知识发现, 2018, 2(7): 46-54.
[6] Zhang Liyi,Li Yiran,Wen Xuan. Predicting Repeat Purchase Intention of New Consumers[J]. 数据分析与知识发现, 2018, 2(11): 10-18.
[7] Lv Weimin,Wang Xiaomei,Han Tao. Recommending Scientific Research Collaborators with Link Prediction and Extremely Randomized Trees Algorithm[J]. 数据分析与知识发现, 2017, 1(4): 38-45.
[8] Yuan Xinwei,Yang Shaohua,Wang Chaochao,Du Zhanhe. Identifying Lead Players of User Innovation Communities Based on Feature Extraction and Random Forest Classification[J]. 数据分析与知识发现, 2017, 1(11): 62-74.
[9] Zhang Liyi, Zhang Jiao. A Brusher Detection Method Based on Principle Component Analysis and Random Forest[J]. 现代图书情报技术, 2015, 31(10): 65-71.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn