Identifying Phishing Websites Based on URL Multi-Granularity Feature Fusion
Hu Zhongyi(),Zhang Shuoguo,Wu Jiang
School of Information Management, Wuhan University, Wuhan 430072, China The Center for Electronic Commerce Research and Development, Wuhan University, Wuhan 430072, China
[Objective] This study proposes a model based on URL multi-granularity feature fusion, aiming to more effectively identify phishing websites. [Methods] First, we retrieved the character-level and word-level features of URLs with one-hot encoding and BERT. Then, we constructed the new identification model by fusing the deep features of both granularities. [Results] The accuracy, recall, F-value, and AUC values of the proposed model reached 0.96, 0.98, 0.97, and 0.97, respectively. It had better performance than the single-granularity feature representation-based models, benchmark classifiers, and other popular models. [Limitations] More research is needed to include webpage contents to the model. [Conclusions] The proposed model can represent URL features more comprehensively, and effectively identify phishing websites.
Sheng S, Wardman B, Warner G, et al. An Empirical Analysis of Phishing Blacklists[C]// Proceedings of the 6th Conference on Email and Anti-Spam. 2009: 112-118.
[2]
Purkait S. Examining the Effectiveness of Phishing Filters Against DNS Based Phishing Attacks[J]. Information & Computer Security, 2015, 23(3): 333-346.
[3]
Blum A, Wardman B, Solorio T, et al. Lexical Feature Based Phishing URL Detection Using Online Learning[C]// Proceedings of the 3rd ACM Workshop on Artificial Intelligence and Security. 2010: 54-60.
(Hu Zhongyi, Wang Chaoqun, Wu Jiang. Identifying Phishing Websites with Multiple Online Data Sources[J]. Data Analysis and Knowledge Discovery, 2017, 1(6): 47-55.)
(Chen Yuan, Wang Chaoqun, Hu Zhongyi, et al. Identifying Malicious Websites with PCA and Random Forest Methods[J]. Data Analysis and Knowledge Discovery, 2018, 2(4): 71-80.)
[7]
Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model[J]. The Journal of Machine Learning Research, 2003, 3: 1137-1155.
[8]
Xiao X, Zhang D Y, Hu G W, et al. CNN-MHSA: A Convolutional Neural Network and Multi-Head Self-Attention Combined Approach for Detecting Phishing Websites[J]. Neural Networks, 2020, 125: 303-312.
doi: S0893-6080(20)30058-7
pmid: 32172140
[9]
Al-Alyan A, Al-Ahmadi S. Robust URL Phishing Detection Based on Deep Learning[J]. KSII Transactions on Internet and Information Systems, 2020, 14(7): 2752-2768.
[10]
Saxe J, Berlin K. EXpose: A Character-Level Convolutional Neural Network with Embeddings for Detecting Malicious URLs, File Paths and Registry Keys[OL]. arXiv Preprint, arXiv: 1702.08568.
[11]
Ozcan A, Catal C, Donmez E, et al. A Hybrid DNN-LSTM Model for Detecting Phishing URLs[J]. Neural Computing and Applications, 2021.DOI: 10.1007/s00521-021-06401-z.
doi: 10.1007/s00521-021-06401-z
[12]
Bahnsen A C, Bohorquez E C, Villegas S, et al. Classifying Phishing URLs Using Recurrent Neural Networks[C]// Proceedings of 2017 APWG Symposium on Electronic Crime Research (eCrime). 2017: 1-8.
[13]
Vinayakumar R, Soman K P, Poornachandran P. Evaluating Deep Learning Approaches to Characterize and Classify Malicious URL’s[J]. Journal of Intelligent & Fuzzy Systems, 2018, 34(3): 1333-1343.
[14]
Yang P, Zhao G Z, Zeng P. Phishing Website Detection Based on Multidimensional Features Driven by Deep Learning[J]. IEEE Access, 2019, 7: 15196-15209.
doi: 10.1109/ACCESS.2019.2892066
[15]
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[16]
Ren F L, Jiang Z W, Liu J. A Bi-Directional LSTM Model with Attention for Malicious URL Detection[C]// Proceedings of 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference. 2019: 300-305.
[17]
Wang W P, Zhang F, Luo X, et al. PDRCNN: Precise Phishing Detection with Recurrent Convolutional Neural Networks[J]. Security and Communication Networks, 2019, 2019: e2595794.
[18]
Huang Y J, Yang Q P, Qin J H, et al. Phishing URL Detection via CNN and Attention-Based Hierarchical RNN[C]// Proceedings of 2019 18th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/13th IEEE International Conference on Big Data Science and Engineering. 2019: 112-119.
[19]
Feng J, Zou L Y, Ye O, et al. Web2Vec: Phishing Webpage Detection Method Based on Multidimensional Features Driven by Deep Learning[J]. IEEE Access, 2020, 8: 221214-221224.
doi: 10.1109/ACCESS.2020.3043188
[20]
Le H, Pham Q, Sahoo D, et al. URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection[OL]. arXiv Preprint, arXiv:1802.03162.
[21]
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.