|
|
Identifying Phishing Websites Based on URL Multi-Granularity Feature Fusion |
Hu Zhongyi( ),Zhang Shuoguo,Wu Jiang |
School of Information Management, Wuhan University, Wuhan 430072, China The Center for Electronic Commerce Research and Development, Wuhan University, Wuhan 430072, China |
|
|
Abstract [Objective] This study proposes a model based on URL multi-granularity feature fusion, aiming to more effectively identify phishing websites. [Methods] First, we retrieved the character-level and word-level features of URLs with one-hot encoding and BERT. Then, we constructed the new identification model by fusing the deep features of both granularities. [Results] The accuracy, recall, F-value, and AUC values of the proposed model reached 0.96, 0.98, 0.97, and 0.97, respectively. It had better performance than the single-granularity feature representation-based models, benchmark classifiers, and other popular models. [Limitations] More research is needed to include webpage contents to the model. [Conclusions] The proposed model can represent URL features more comprehensively, and effectively identify phishing websites.
|
Received: 23 February 2022
Published: 13 January 2023
|
|
Fund:Major Project of Philosophy and Social Science Research of the Ministry of Education(20JZD024);China Postdoctoral Science Foundation(2019T120690) |
Corresponding Authors:
Hu Zhongyi
E-mail: zhongyi.hu@whu.edu.cn
|
[1] |
Sheng S, Wardman B, Warner G, et al. An Empirical Analysis of Phishing Blacklists[C]// Proceedings of the 6th Conference on Email and Anti-Spam. 2009: 112-118.
|
[2] |
Purkait S. Examining the Effectiveness of Phishing Filters Against DNS Based Phishing Attacks[J]. Information & Computer Security, 2015, 23(3): 333-346.
|
[3] |
Blum A, Wardman B, Solorio T, et al. Lexical Feature Based Phishing URL Detection Using Online Learning[C]// Proceedings of the 3rd ACM Workshop on Artificial Intelligence and Security. 2010: 54-60.
|
[4] |
黄华军, 钱亮, 王耀钧. 基于异常特征的钓鱼网站URL检测技术[J]. 信息网络安全, 2012(1): 23-25, 67.
|
[4] |
(Huang Huajun, Qian Liang, Wang Yaojun. Detection of Phishing URL Based on Abnormal Feature[J]. Netinfo Security, 2012(1): 23-25, 67.)
|
[5] |
胡忠义, 王超群, 吴江. 融合多源网络评估数据及URL特征的钓鱼网站识别技术研究[J]. 数据分析与知识发现, 2017, 1(6): 47-55.
|
[5] |
(Hu Zhongyi, Wang Chaoqun, Wu Jiang. Identifying Phishing Websites with Multiple Online Data Sources[J]. Data Analysis and Knowledge Discovery, 2017, 1(6): 47-55.)
|
[6] |
陈远, 王超群, 胡忠义, 等. 基于主成分分析和随机森林的恶意网站评估与识别[J]. 数据分析与知识发现, 2018, 2(4): 71-80.
|
[6] |
(Chen Yuan, Wang Chaoqun, Hu Zhongyi, et al. Identifying Malicious Websites with PCA and Random Forest Methods[J]. Data Analysis and Knowledge Discovery, 2018, 2(4): 71-80.)
|
[7] |
Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model[J]. The Journal of Machine Learning Research, 2003, 3: 1137-1155.
|
[8] |
Xiao X, Zhang D Y, Hu G W, et al. CNN-MHSA: A Convolutional Neural Network and Multi-Head Self-Attention Combined Approach for Detecting Phishing Websites[J]. Neural Networks, 2020, 125: 303-312.
doi: S0893-6080(20)30058-7
pmid: 32172140
|
[9] |
Al-Alyan A, Al-Ahmadi S. Robust URL Phishing Detection Based on Deep Learning[J]. KSII Transactions on Internet and Information Systems, 2020, 14(7): 2752-2768.
|
[10] |
Saxe J, Berlin K. EXpose: A Character-Level Convolutional Neural Network with Embeddings for Detecting Malicious URLs, File Paths and Registry Keys[OL]. arXiv Preprint, arXiv: 1702.08568.
|
[11] |
Ozcan A, Catal C, Donmez E, et al. A Hybrid DNN-LSTM Model for Detecting Phishing URLs[J]. Neural Computing and Applications, 2021.DOI: 10.1007/s00521-021-06401-z.
doi: 10.1007/s00521-021-06401-z
|
[12] |
Bahnsen A C, Bohorquez E C, Villegas S, et al. Classifying Phishing URLs Using Recurrent Neural Networks[C]// Proceedings of 2017 APWG Symposium on Electronic Crime Research (eCrime). 2017: 1-8.
|
[13] |
Vinayakumar R, Soman K P, Poornachandran P. Evaluating Deep Learning Approaches to Characterize and Classify Malicious URL’s[J]. Journal of Intelligent & Fuzzy Systems, 2018, 34(3): 1333-1343.
|
[14] |
Yang P, Zhao G Z, Zeng P. Phishing Website Detection Based on Multidimensional Features Driven by Deep Learning[J]. IEEE Access, 2019, 7: 15196-15209.
doi: 10.1109/ACCESS.2019.2892066
|
[15] |
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
|
[16] |
Ren F L, Jiang Z W, Liu J. A Bi-Directional LSTM Model with Attention for Malicious URL Detection[C]// Proceedings of 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference. 2019: 300-305.
|
[17] |
Wang W P, Zhang F, Luo X, et al. PDRCNN: Precise Phishing Detection with Recurrent Convolutional Neural Networks[J]. Security and Communication Networks, 2019, 2019: e2595794.
|
[18] |
Huang Y J, Yang Q P, Qin J H, et al. Phishing URL Detection via CNN and Attention-Based Hierarchical RNN[C]// Proceedings of 2019 18th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/13th IEEE International Conference on Big Data Science and Engineering. 2019: 112-119.
|
[19] |
Feng J, Zou L Y, Ye O, et al. Web2Vec: Phishing Webpage Detection Method Based on Multidimensional Features Driven by Deep Learning[J]. IEEE Access, 2020, 8: 221214-221224.
doi: 10.1109/ACCESS.2020.3043188
|
[20] |
Le H, Pham Q, Sahoo D, et al. URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection[OL]. arXiv Preprint, arXiv:1802.03162.
|
[21] |
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|