Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (6): 47-55    DOI: 10.11925/infotech.2096-3467.2017.06.05
Orginal Article Current Issue | Archive | Adv Search |
Identifying Phishing Websites with Multiple Online Data Sources
Zhongyi Hu(),Chaoqun Wang,Jiang Wu
School of Information Management, Wuhan University, Wuhan 430072, China
The Center for Electronic Commerce Research and Development, Wuhan University, Wuhan 430072, China
Download: PDF(1554 KB)   HTML ( 2
Export: BibTeX | EndNote (RIS)      

[Objective] This study aims to identify phishing websites more effectively with the help of online evaluation data and URL abnormal features. [Methods] First, we used eight machine learning techniques to compare the performance of various online evaluation data and URL abnormal features in identifying phishing websites. Then, we proposed a new method to improve the accuracy of the identification procedures. [Results] We found that the evaluation data had better performance than abnormal features of URL. Combining the two data sets could improve the identification performance. [Limitations] We did not consider the difference between the numbers of phishing sites and the good ones. [Conclusions] Online evaluation data and URL abnormal features could help us identify phishing websites effectively, which indicates the direction of future studies.

Key wordsData Mining      Phishing Websites Identification      Machine Learning     
Received: 10 April 2017      Published: 25 August 2017

Cite this article:

Zhongyi Hu,Chaoqun Wang,Jiang Wu. Identifying Phishing Websites with Multiple Online Data Sources. Data Analysis and Knowledge Discovery, 2017, 1(6): 47-55.

URL:     OR

[1] Sheng S, Weidman B, Warner G, et al.An Empirical Analysis of Phishing Blacklists[C]//Proceedings of the 6th Conference on Email and Anti-Spam, California, USA.2009: 112-118.
[2] Zhang Y, Egelman S, Cranor L, et al.Phinding Phish: Evaluating Anti-phishing Tools[C]//Proceedings of the 14th Annual Network and Distributed System Security Symposium. 2007: 381-192.
[3] Blum A, Warden B, Solaria T, et al.Lexical Feature Based Phishing URL Detection Using Online Learning[C]// Proceedings of the ACM Workshop on Artificial Intelligence & Security. 2010: 54-60.
[4] 黄华军, 钱亮, 王耀钧. 基于异常特征的钓鱼网站 URL 检测技术[J]. 信息网络安全, 2012 (1): 23-25.
[4] (Huang Huajun, Qian Liang, Wang Yaojun.Detection of Phishing URL Based on Abnormal Feature[J]. Netinfo Security, 2012(1): 23-25.)
[5] Ma J, Saul L K, Savage S, et al.Identifying Suspicious URLs: An Application of Large-scale Online Learning[C]// Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009: 681-688.
[6] Ma J, Saul L K, Savage S, et al.Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs[C]// Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2009: 1245-1254.
[7] 曾传璜, 李思强, 张小红. 基于AdaCostBoost 算法的网络钓鱼检测[J]. 计算机系统应用, 2015, 24(9): 129-133.
[7] (Zeng Chuanhuang, Li Siqiang, Zhang Xiaohong.Phishing Detection System Based on AdaCostBoost Algorithm[J]. Computer Systems & Applications, 2015, 24(9): 129-133.)
[8] Thomas K, Grier C, Ma J, et a1. Design and Evaluation of a Real-time URL Spam Filtering Service[C]// Proceedings of the 2011 IEEE Symposium on Security and Privacy, Berkeley, California, USA. 2011: 376-382.
[9] 顾晓清, 王洪元, 倪彤光, 等. 基于贝叶斯和支持向量机的钓鱼网站检测方法[J]. 计算机工程与应用, 2015, 51(4): 87-90.
[9] (Gu Xiaoqing, Wang Hongyuan, Ni Tongguang, et al.Phishing Detection Approach Based on Na?ve Bayes and Support Vector Machine[J]. Computer Engineering and Applications, 2015, 51(4): 87-90.)
[10] Hu Z, Chiong R, Pranata I, et al.Identifying Malicious Web Domains Using Machine Learning Techniques with Online Credibility and Performance Data[C]//Proceedings of the 2016 IEEE Congress on Evolutionary Computation (CEC), Vancouver, Canada. 2016: 5186-5194.
[11] Kursa M B, Rudnicki W R.Feature Selection with the Boruta Package[J]. Journal of Statistical Software, 2010, 36(11): 1-13.
[12] Freund Y, Schapire R E.A Decision-theoretic Generalization of On-line Learning and an Application to Boosting[J]. Journal of Computer and System Sciences, 1997, 55(1): 119-139.
[13] Lo S L, Chiong R, Cornforth D.Using Support Vector Machine Ensembles for Target Audience Classification on Twitter[J]. PLoS One, 2015, 10(3): 417-434.
[14] Bayes T, Price R, Canton J.An Essay Towards Solving a Problem in the Doctrine of Chances[J]. Reasonance, 2003, 8(4): 80-88.
[15] Breiman L.Random Forests[J]. Machine Learning, 2001, 45(1): 5-32.
[1] Jiahui Hu,An Fang,Wanqing Zhao,Chenliu Yang,Huiling Ren. Annotating Chinese E-Medical Record for Knowledge Discovery[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
[2] Yong Zhang,Shuqing Li,Yongshang Cheng. Mining Algorithm for Weighted Association Rules Based on Frequency Effective Length[J]. 数据分析与知识发现, 2019, 3(7): 85-93.
[3] Jinzhu Zhang,Yiming Hu. Extracting Titles from Scientific References in Patents with Fusion of Representation Learning and Machine Learning[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
[4] Quan Lu,Anqi Zhu,Jiyue Zhang,Jing Chen. Research on User Information Requirement in Chinese Network Health Community: Taking Tumor-forum Data of Qiuyi as an Example[J]. 数据分析与知识发现, 2019, 3(4): 22-32.
[5] Dongmei Mu,Hui Fa,Ping Wang,Jing Sun. Research on Disease Risk Factors on Structural Equation Model[J]. 数据分析与知识发现, 2019, 3(4): 80-89.
[6] Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[7] Hongxia Xu,Chunwang Li. Review of Knowledge Extraction of Scientific Literature[J]. 数据分析与知识发现, 2019, 3(3): 14-24.
[8] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[9] Lina Liu,Jiayin Qi,Zhenping Zhang,Dan Zeng. Analyzing Impacts of Brand Reputation on Online Sales Based on Massive Commodity Reviews and Brand[J]. 数据分析与知识发现, 2018, 2(9): 10-21.
[10] Longjia Jia,Bangzuo Zhang. Classifying Topics of Internet Public Opinion from College Students: Case Study of Sina Weibo[J]. 数据分析与知识发现, 2018, 2(7): 55-62.
[11] Wei Lu,Mengqi Luo,Heng Ding,Xin Li. Image Annotation Tags by Deep Learning and Real Users: A Comparative Study[J]. 数据分析与知识发现, 2018, 2(5): 1-10.
[12] Li Wang,Lixue Zou,Xiwen Liu. Visualizing Document Correlation Based on LDA Model[J]. 数据分析与知识发现, 2018, 2(3): 98-106.
[13] Xinyue Fan,Lei Cui. Predicting Antineoplastic Drug Targets Based on Network Properties[J]. 数据分析与知识发现, 2018, 2(12): 98-108.
[14] Yang Zhao,Xini Yuan,Yawen Chen,Liqiang Wu. Predicting Conversion Rate of APP Advertising with Machine Learning[J]. 数据分析与知识发现, 2018, 2(11): 2-9.
[15] Xin Wang,Wen’gang Feng. Review of Techniques Detecting Online Extremism and Radicalization[J]. 数据分析与知识发现, 2018, 2(10): 2-8.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938