Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (6): 47-55    DOI: 10.11925/infotech.2096-3467.2017.06.05
Orginal Article Current Issue | Archive | Adv Search |
Identifying Phishing Websites with Multiple Online Data Sources
Hu Zhongyi(), Wang Chaoqun, Wu Jiang
School of Information Management, Wuhan University, Wuhan 430072, China
The Center for Electronic Commerce Research and Development, Wuhan University, Wuhan 430072, China
Download: PDF (1554 KB)   HTML ( 5
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study aims to identify phishing websites more effectively with the help of online evaluation data and URL abnormal features. [Methods] First, we used eight machine learning techniques to compare the performance of various online evaluation data and URL abnormal features in identifying phishing websites. Then, we proposed a new method to improve the accuracy of the identification procedures. [Results] We found that the evaluation data had better performance than abnormal features of URL. Combining the two data sets could improve the identification performance. [Limitations] We did not consider the difference between the numbers of phishing sites and the good ones. [Conclusions] Online evaluation data and URL abnormal features could help us identify phishing websites effectively, which indicates the direction of future studies.

Key wordsData Mining      Phishing Websites Identification      Machine Learning     
Received: 10 April 2017      Published: 25 August 2017
ZTFLH:  G353  

Cite this article:

Hu Zhongyi,Wang Chaoqun,Wu Jiang. Identifying Phishing Websites with Multiple Online Data Sources. Data Analysis and Knowledge Discovery, 2017, 1(6): 47-55.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.06.05     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I6/47

判断是正常网站 判断是钓鱼网站
实际是正常网站 TN FP
实际是钓鱼网站 FN TP
方法 准确率 查准率 查全率 F值
决策树 0.5935 0.9099 0.2150 0.3433
SVM 0.6340 0.7744 0.3780 0.5074
K近邻法 0.6205 0.6411 0.5610 0.5954
朴素贝叶斯 0.5990 0.9720 0.2040 0.3362
人工神经网络 0.6420 0.7535 0.4290 0.5457
AdaBoost 0.6435 0.7500 0.4400 0.5534
Bagging 0.6445 0.7587 0.4260 0.5443
随机森林 0.6390 0.7828 0.3850 0.5155
方法 准确率 查准率 查全率 F值
决策树 0.8810 0.8576 0.9160 0.8845
SVM 0.9145 0.9026 0.9310 0.9159
K近邻法 0.9115 0.9030 0.9240 0.9126
朴素贝叶斯 0.7455 0.6659 0.9890 0.7956
人工神经网络 0.8695 0.9226 0.8460 0.8818
AdaBoost 0.9415 0.9335 0.9500 0.9412
Bagging 0.9230 0.9174 0.9310 0.9234
随机森林 0.9415 0.9355 0.9500 0.9421
方法 准确率 查准率 查全率 F值
决策树 0.8810 0.8576 0.9160 0.8845
SVM 0.9119 0.9280 0.9194 0.9185
K近邻法 0.9200 0.9133 0.9300 0.9208
朴素贝叶斯 0.7690 0.6881 0.9880 0.8108
人工神经网络 0.8945 0.8879 0.8710 0.8776
AdaBoost 0.9415 0.9383 0.9430 0.9403
Bagging 0.9230 0.9174 0.9310 0.9234
随机森林 0.9435 0.9363 0.9530 0.9442
[1] Sheng S, Weidman B, Warner G, et al.An Empirical Analysis of Phishing Blacklists[C]//Proceedings of the 6th Conference on Email and Anti-Spam, California, USA.2009: 112-118.
[2] Zhang Y, Egelman S, Cranor L, et al.Phinding Phish: Evaluating Anti-phishing Tools[C]//Proceedings of the 14th Annual Network and Distributed System Security Symposium. 2007: 381-192.
[3] Blum A, Warden B, Solaria T, et al.Lexical Feature Based Phishing URL Detection Using Online Learning[C]// Proceedings of the ACM Workshop on Artificial Intelligence & Security. 2010: 54-60.
[4] 黄华军, 钱亮, 王耀钧. 基于异常特征的钓鱼网站 URL 检测技术[J]. 信息网络安全, 2012 (1): 23-25.
[4] (Huang Huajun, Qian Liang, Wang Yaojun.Detection of Phishing URL Based on Abnormal Feature[J]. Netinfo Security, 2012(1): 23-25.)
[5] Ma J, Saul L K, Savage S, et al.Identifying Suspicious URLs: An Application of Large-scale Online Learning[C]// Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009: 681-688.
[6] Ma J, Saul L K, Savage S, et al.Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs[C]// Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2009: 1245-1254.
[7] 曾传璜, 李思强, 张小红. 基于AdaCostBoost 算法的网络钓鱼检测[J]. 计算机系统应用, 2015, 24(9): 129-133.
[7] (Zeng Chuanhuang, Li Siqiang, Zhang Xiaohong.Phishing Detection System Based on AdaCostBoost Algorithm[J]. Computer Systems & Applications, 2015, 24(9): 129-133.)
[8] Thomas K, Grier C, Ma J, et a1. Design and Evaluation of a Real-time URL Spam Filtering Service[C]// Proceedings of the 2011 IEEE Symposium on Security and Privacy, Berkeley, California, USA. 2011: 376-382.
[9] 顾晓清, 王洪元, 倪彤光, 等. 基于贝叶斯和支持向量机的钓鱼网站检测方法[J]. 计算机工程与应用, 2015, 51(4): 87-90.
[9] (Gu Xiaoqing, Wang Hongyuan, Ni Tongguang, et al.Phishing Detection Approach Based on Naïve Bayes and Support Vector Machine[J]. Computer Engineering and Applications, 2015, 51(4): 87-90.)
[10] Hu Z, Chiong R, Pranata I, et al.Identifying Malicious Web Domains Using Machine Learning Techniques with Online Credibility and Performance Data[C]//Proceedings of the 2016 IEEE Congress on Evolutionary Computation (CEC), Vancouver, Canada. 2016: 5186-5194.
[11] Kursa M B, Rudnicki W R.Feature Selection with the Boruta Package[J]. Journal of Statistical Software, 2010, 36(11): 1-13.
doi: 10.18637/jss.v036.i11
[12] Freund Y, Schapire R E.A Decision-theoretic Generalization of On-line Learning and an Application to Boosting[J]. Journal of Computer and System Sciences, 1997, 55(1): 119-139.
doi: 10.1007/3-540-59119-2_166
[13] Lo S L, Chiong R, Cornforth D.Using Support Vector Machine Ensembles for Target Audience Classification on Twitter[J]. PLoS One, 2015, 10(3): 417-434.
doi: 10.1371/journal.pone.0122855 pmid: 4395415
[14] Bayes T, Price R, Canton J.An Essay Towards Solving a Problem in the Doctrine of Chances[J]. Reasonance, 2003, 8(4): 80-88.
doi: 10.1007/BF02883540
[15] Breiman L.Random Forests[J]. Machine Learning, 2001, 45(1): 5-32.
doi: 10.1023/A:1010933404324
[1] Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2] Chen Donghua,Zhao Hongmei,Shang Xiaopu,Zhang Runtong. Optimizing Large Hospital Operating Rooms with Data Analytics[J]. 数据分析与知识发现, 2021, 5(9): 115-128.
[3] Che Hongxin,Wang Tong,Wang Wei. Comparing Prediction Models for Prostate Cancer[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[4] Su Qiang, Hou Xiaoli, Zou Ni. Predicting Surgical Infections Based on Machine Learning[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[5] Cao Rui,Liao Bin,Li Min,Sun Ruina. Predicting Prices and Analyzing Features of Online Short-Term Rentals Based on XGBoost[J]. 数据分析与知识发现, 2021, 5(6): 51-65.
[6] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[7] Xiang Zhuoyuan,Liu Zhicong,Wu Yu. Adaptive Recommendation Model Based on User Behaviors[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[8] Xie Wang, Wang Lizhen, Chen Hongmei, Zeng Lanqing. Identifying Relationship Between Pollution Sources and Cancer Cases with Spatial Ordered Pair Patterns[J]. 数据分析与知识发现, 2021, 5(2): 14-31.
[9] Chai Guorong,Wang Bin,Sha Yongzhong. Public Health Risk Forecasting with Multiple Machine Learning Methods Combined:Case Study of Influenza Forecasting in Lanzhou, China[J]. 数据分析与知识发现, 2021, 5(1): 90-98.
[10] Chen Dong,Wang Jiandong,Li Huiying,Cai Sihang,Huang Qianqian,Yi Chengqi,Cao Pan. Forecasting Poultry Turnovers with Machine Learning and Multiple Factors[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
[11] Liang Ye,Li Xiaoyuan,Xu Hang,Hu Yiran. CLOpin: A Cross-Lingual Knowledge Graph Framework for Public Opinion Analysis and Early Warning[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[12] Yang Heng,Wang Sili,Zhu Zhongming,Liu Wei,Wang Nan. Recommending Domain Knowledge Based on Parallel Collaborative Filtering Algorithm[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[13] Wang Shuyi,Liu Sai,Ma Zheng. Microblog Image Privacy Classification with Deep Transfer Learning[J]. 数据分析与知识发现, 2020, 4(10): 80-92.
[14] Ruojia Wang,Lu Zhang,Jimin Wang. Automatic Triage of Online Doctor Services Based on Machine Learning[J]. 数据分析与知识发现, 2019, 3(9): 88-97.
[15] Gang Li,Huayang Zhou,Jin Mao,Sijing Chen. Classifying Social Media Users with Machine Learning[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn