Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (11): 103-110    DOI: 10.11925/infotech.2096-3467.2022.0141
Current Issue | Archive | Adv Search |
Identifying Phishing Websites Based on URL Multi-Granularity Feature Fusion
Hu Zhongyi(),Zhang Shuoguo,Wu Jiang
School of Information Management, Wuhan University, Wuhan 430072, China
The Center for Electronic Commerce Research and Development, Wuhan University, Wuhan 430072, China
Download: PDF (728 KB)   HTML ( 9
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study proposes a model based on URL multi-granularity feature fusion, aiming to more effectively identify phishing websites. [Methods] First, we retrieved the character-level and word-level features of URLs with one-hot encoding and BERT. Then, we constructed the new identification model by fusing the deep features of both granularities. [Results] The accuracy, recall, F-value, and AUC values of the proposed model reached 0.96, 0.98, 0.97, and 0.97, respectively. It had better performance than the single-granularity feature representation-based models, benchmark classifiers, and other popular models. [Limitations] More research is needed to include webpage contents to the model. [Conclusions] The proposed model can represent URL features more comprehensively, and effectively identify phishing websites.

Key wordsPhishing Websites Identification      Feature Fusion      BERT      Word2Vec      CNN      LSTM     
Received: 23 February 2022      Published: 13 January 2023
ZTFLH:  G353  
Fund:Major Project of Philosophy and Social Science Research of the Ministry of Education(20JZD024);China Postdoctoral Science Foundation(2019T120690)
Corresponding Authors: Hu Zhongyi     E-mail: zhongyi.hu@whu.edu.cn

Cite this article:

Hu Zhongyi,Zhang Shuoguo,Wu Jiang. Identifying Phishing Websites Based on URL Multi-Granularity Feature Fusion. Data Analysis and Knowledge Discovery, 2022, 6(11): 103-110.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0141     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I11/103

Phishing Websites Identification Model Structure
标签 类别 训练集 测试集
0 钓鱼网站 13 434 1 493
1 合法网站 13 248 1 472
合计 26 682 2 965
Division of Training Set and Test Set
参数名 参数意义 单词级模型 字符级模型
Length 词集(或字符)的限定长度 50 128
Filter CNN层filter个数 3 3
Kernel CNN层kernel大小 6 32
Stride CNN层步长 1 1
Dropout LSTM中输入层到隐层dropout概率 0.3 0.3
Recurrent LSTM中隐层dropout概率 0.3 0.3
Outsize 输出维度 60 95
Parameters Setting of URL Feature Extraction Model
准确率 召回率 F1值 AUC
0.96 0.98 0.97 0.97
Model Performance
特征提取模型 缩写 准确率 召回率 F1值 AUC
基于独热编码的字符级 Co 0.81 0.86 0.83 0.83
单词级 基于独热编码 Wo 0.83 0.82 0.83 0.81
基于Word2vec Wv 0.76 0.76 0.76 0.76
基于BERT Wb 0.92 0.95 0.93 0.89
融合字符级与单词级 基于独热编码 Co&Wo 0.91 0.93 0.93 0.93
基于Word2vec Co&Wv 0.79 0.78 0.76 0.78
基于BERT
(本文)
Co&Wb 0.96 0.98 0.97 0.97
Performance Based on Different Feature Extraction Models
多粒度特征提取 基准分类器 准确率 召回率 F1值 AUC
Co&Wb 全连接网络
(本文模型)
0.96 0.98 0.97 0.97
支持向量机 0.91 0.91 0.90 0.90
决策树 0.94 0.91 0.92 0.91
K近邻算法 0.89 0.89 0.88 0.88
随机森林 0.95 0.96 0.97 0.96
Co&Wv 全连接网络 0.79 0.78 0.76 0.78
支持向量机 0.82 0.81 0.81 0.80
决策树 0.79 0.78 0.78 0.78
K近邻算法 0.77 0.77 0.77 0.79
随机森林 0.81 0.82 0.80 0.79
Co&Wo 全连接网络 0.91 0.93 0.93 0.93
支持向量机 0.88 0.89 0.89 0.89
决策树 0.90 0.90 0.90 0.90
K近邻算法 0.86 0.85 0.86 0.85
随机森林 0.90 0.91 0.91 0.90
Performance of Different Identification Models
模型 准确率 召回率 F1值 AUC
本文模型 0.96 0.98 0.97 0.97
AI-Alyan等[9] 0.91 0.86 0.88 0.90
Ren等[16] 0.83 0.87 0.85 0.80
Yang等[14] 0.87 0.90 0.89 0.90
Huang等[18] 0.96 0.97 0.97 0.96
Performance with Existing Models
[1] Sheng S, Wardman B, Warner G, et al. An Empirical Analysis of Phishing Blacklists[C]// Proceedings of the 6th Conference on Email and Anti-Spam. 2009: 112-118.
[2] Purkait S. Examining the Effectiveness of Phishing Filters Against DNS Based Phishing Attacks[J]. Information & Computer Security, 2015, 23(3): 333-346.
[3] Blum A, Wardman B, Solorio T, et al. Lexical Feature Based Phishing URL Detection Using Online Learning[C]// Proceedings of the 3rd ACM Workshop on Artificial Intelligence and Security. 2010: 54-60.
[4] 黄华军, 钱亮, 王耀钧. 基于异常特征的钓鱼网站URL检测技术[J]. 信息网络安全, 2012(1): 23-25, 67.
[4] (Huang Huajun, Qian Liang, Wang Yaojun. Detection of Phishing URL Based on Abnormal Feature[J]. Netinfo Security, 2012(1): 23-25, 67.)
[5] 胡忠义, 王超群, 吴江. 融合多源网络评估数据及URL特征的钓鱼网站识别技术研究[J]. 数据分析与知识发现, 2017, 1(6): 47-55.
[5] (Hu Zhongyi, Wang Chaoqun, Wu Jiang. Identifying Phishing Websites with Multiple Online Data Sources[J]. Data Analysis and Knowledge Discovery, 2017, 1(6): 47-55.)
[6] 陈远, 王超群, 胡忠义, 等. 基于主成分分析和随机森林的恶意网站评估与识别[J]. 数据分析与知识发现, 2018, 2(4): 71-80.
[6] (Chen Yuan, Wang Chaoqun, Hu Zhongyi, et al. Identifying Malicious Websites with PCA and Random Forest Methods[J]. Data Analysis and Knowledge Discovery, 2018, 2(4): 71-80.)
[7] Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model[J]. The Journal of Machine Learning Research, 2003, 3: 1137-1155.
[8] Xiao X, Zhang D Y, Hu G W, et al. CNN-MHSA: A Convolutional Neural Network and Multi-Head Self-Attention Combined Approach for Detecting Phishing Websites[J]. Neural Networks, 2020, 125: 303-312.
doi: S0893-6080(20)30058-7 pmid: 32172140
[9] Al-Alyan A, Al-Ahmadi S. Robust URL Phishing Detection Based on Deep Learning[J]. KSII Transactions on Internet and Information Systems, 2020, 14(7): 2752-2768.
[10] Saxe J, Berlin K. EXpose: A Character-Level Convolutional Neural Network with Embeddings for Detecting Malicious URLs, File Paths and Registry Keys[OL]. arXiv Preprint, arXiv: 1702.08568.
[11] Ozcan A, Catal C, Donmez E, et al. A Hybrid DNN-LSTM Model for Detecting Phishing URLs[J]. Neural Computing and Applications, 2021.DOI: 10.1007/s00521-021-06401-z.
doi: 10.1007/s00521-021-06401-z
[12] Bahnsen A C, Bohorquez E C, Villegas S, et al. Classifying Phishing URLs Using Recurrent Neural Networks[C]// Proceedings of 2017 APWG Symposium on Electronic Crime Research (eCrime). 2017: 1-8.
[13] Vinayakumar R, Soman K P, Poornachandran P. Evaluating Deep Learning Approaches to Characterize and Classify Malicious URL’s[J]. Journal of Intelligent & Fuzzy Systems, 2018, 34(3): 1333-1343.
[14] Yang P, Zhao G Z, Zeng P. Phishing Website Detection Based on Multidimensional Features Driven by Deep Learning[J]. IEEE Access, 2019, 7: 15196-15209.
doi: 10.1109/ACCESS.2019.2892066
[15] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[16] Ren F L, Jiang Z W, Liu J. A Bi-Directional LSTM Model with Attention for Malicious URL Detection[C]// Proceedings of 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference. 2019: 300-305.
[17] Wang W P, Zhang F, Luo X, et al. PDRCNN: Precise Phishing Detection with Recurrent Convolutional Neural Networks[J]. Security and Communication Networks, 2019, 2019: e2595794.
[18] Huang Y J, Yang Q P, Qin J H, et al. Phishing URL Detection via CNN and Attention-Based Hierarchical RNN[C]// Proceedings of 2019 18th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/13th IEEE International Conference on Big Data Science and Engineering. 2019: 112-119.
[19] Feng J, Zou L Y, Ye O, et al. Web2Vec: Phishing Webpage Detection Method Based on Multidimensional Features Driven by Deep Learning[J]. IEEE Access, 2020, 8: 221214-221224.
doi: 10.1109/ACCESS.2020.3043188
[20] Le H, Pham Q, Sahoo D, et al. URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection[OL]. arXiv Preprint, arXiv:1802.03162.
[21] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[1] Wang Dailin, Liu Lina, Liu Meiling, Liu Yaqiu. Reader Preference Analysis and Book Recommendation Model with Attention Mechanism of Catalogs[J]. 数据分析与知识发现, 2022, 6(9): 138-152.
[2] Zhang Zhipeng, Mao Yusheng, Zhang Liyi. Classifying Reasons of Hotel Reviews with Domain ERNIE and BiLSTM Model[J]. 数据分析与知识发现, 2022, 6(9): 65-76.
[3] You Xindong, Yuan Menglong, Zhang Le, Lv Xueqiang. CNN-SM: Identifying Words on Defective Products with Sememe and Multi-features[J]. 数据分析与知识发现, 2022, 6(9): 77-85.
[4] Hu Jiming, Qian Wei, Wen Peng, Lv Xiaoguang. Text Semantic Representation with Structure-Function and Entity Recognition: Case Study of Medical Records[J]. 数据分析与知识发现, 2022, 6(8): 110-121.
[5] Shi Yunmei, Yuan Bo, Zhang Le, Lv Xueqiang. IMTS: Detecting Fake Reviews with Image and Text Semantics[J]. 数据分析与知识发现, 2022, 6(8): 84-96.
[6] Zhang Shunxiang, Zhang Zhenjiang, Zhu Guangli, Zhao Tong, Huang Ju. Identifying Financial Text Causality with Bi-LSTM and Two-way CNN[J]. 数据分析与知识发现, 2022, 6(7): 118-127.
[7] Yang Wenli, Li Nana. A Text-Aligned Cross-Language Sentiment Classification Method Based on Adversarial Networks[J]. 数据分析与知识发现, 2022, 6(7): 141-151.
[8] Wu Jiang, Liu Tao, Liu Yang. Mining Online User Profiles and Self-Presentations: Case Study of NetEase Music Community[J]. 数据分析与知识发现, 2022, 6(7): 56-69.
[9] Zheng Jie, Huang Hui, Qin Yongbin. Matching Similar Cases with Legal Knowledge Fusion[J]. 数据分析与知识发现, 2022, 6(7): 99-106.
[10] Pan Huiping, Li Baoan, Zhang Le, Lv Xueqiang. Extracting Keywords from Government Work Reports with Multi-feature Fusion[J]. 数据分析与知识发现, 2022, 6(5): 54-63.
[11] Xiao Yuejun, Li Honglian, Zhang Le, Lv Xueqiang, You Xindong. Classifying Chinese Patent Texts with Feature Fusion[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[12] Yang Lin, Huang Xiaoshuo, Wang Jiayang, Ding Lingling, Li Zixiao, Li Jiao. Identifying Subtypes of Clinical Trial Diseases with BERT-TextCNN[J]. 数据分析与知识发现, 2022, 6(4): 69-81.
[13] Guo Hangcheng, He Yanqing, Lan Tian, Wu Zhenfeng, Dong Cheng. Identifying Moves from Scientific Abstracts Based on Paragraph-BERT-CRF[J]. 数据分析与知识发现, 2022, 6(2/3): 298-307.
[14] Wei Tingting, Jiang Tao, Zheng Shuling, Zhang Jiantao. Extracting Chinese Patent Keywords with LSTM and Logistic Regression[J]. 数据分析与知识发现, 2022, 6(2/3): 308-317.
[15] Zhou Yunze, Min Chao. Identifying Emerging Technology with LDA Model and Shared Semantic Space——Case Study of Autonomous Vehicles[J]. 数据分析与知识发现, 2022, 6(2/3): 55-66.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn