Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (5): 116-122    DOI: 10.11925/infotech.2096-3467.2022.0609
Current Issue | Archive | Adv Search |
A Novel Borderline Over-Sampling Method Based on KNN and Deep Gaussian Mixture Model for Imbalanced Data
Zhang Haibin1,2,Xiao Han1,3(),Yi Cancan1,3,Yuan Rui1,3
1Key Laboratory of Metallurgical Equipment and Control Technology, Ministry of Education, Wuhan University of Science and Technology, Wuhan 430081, China
2Hubei Key Laboratory of Mechanical Transmission and Manufacturing Engineering, Wuhan University of Science and Technology, Wuhan 430081, China
3Precision Manufacturing Institute, Wuhan University of Science and Technology, Wuhan 430081, China
Download: PDF (853 KB)   HTML ( 4
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a borderline oversampling method based on the k-nearest neighbor algorithm (KNN) and Deep Gaussian Mixture Model (DGMM) to address the classifier bias due to data imbalance. [Methods] Firstly, we used the KNN algorithm to obtain the borderline minority samples in the training set. Secondly, we constructed a DGMM for the minority samples. Next, we applied the DGMM in reverse to generate the oversampling samples that conform to the distribution characteristics of the borderline minority samples. Finally, we used the three sigma guidelines to remove noise samples. We repeated the process until no outlier samples were generated. [Results] The proposed method improved the AUC and G-mean up to 8.62% and 12.99%, respectively. The corresponding average increased by 3.51% and 4.93%. [Limitations] The parameter optimization method for DGMM needs further improvement. [Conclusions] The proposed method can better address the problem of imbalanced data.

Key wordsImbalanced Data      Over-Sampling      Deep Gaussian Mixture Model     
Received: 14 June 2022      Published: 04 July 2023
ZTFLH:  TP311  
Fund:Key R&D Projects in Hubei Province(2021BAA194);National Natural Science Foundation of China(51875416);China Postdoctoral Science Foundation(2020M682492)
Corresponding Authors: Xiao Han,ORCID:0000-0001-8705-7728,E-mail:coolxiaohan@163.com。   

Cite this article:

Zhang Haibin, Xiao Han, Yi Cancan, Yuan Rui. A Novel Borderline Over-Sampling Method Based on KNN and Deep Gaussian Mixture Model for Imbalanced Data. Data Analysis and Knowledge Discovery, 2023, 7(5): 116-122.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0609     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I5/116

Flowchart of the Borderline Over-Sampling Method Based on KNN and DGMM
数据集 特征数量 样本数量 IR
haberman 3 306 2.78
vehicle0 18 846 3.25
ecoli1 7 336 3.36
vowel0 13 988 9.98
abalone9-18 8 731 16.4
abalone-19_vs_10-11-12-13 8 1 622 49.69
abalone-20_vs_8-9-10 8 1 916 72.69
abalone19 8 4 174 129.44
The Information of 8 Public Imbalanced Datasets
数据集 评价指标 Normal SMOTE BorSMOTE SafSMOTE DGMM
haberman AUC 0.585 4 0.625 1 0.644 6 0.626 5 0.659 5
G均值 0.491 6 0.612 7 0.636 5 0.613 8 0.652 3
vehicle0 AUC 0.948 3 0.957 8 0.961 0 0.961 4 0.965 1
G均值 0.947 5 0.957 5 0.960 8 0.961 1 0.964 9
ecoli1 AUC 0.847 2 0.869 3 0.861 5 0.873 0 0.876 5
G均值 0.841 0 0.867 2 0.859 6 0.870 0 0.875 2
vowel0 AUC 0.979 0 0.985 9 0.987 1 0.985 4 0.986 5
G均值 0.978 5 0.985 7 0.986 9 0.985 2 0.986 4
abalone9-18 AUC 0.637 3 0.716 9 0.698 4 0.688 9 0.750 2
G均值 0.518 3 0.691 8 0.670 1 0.632 3 0.737 1
abalone-19_vs_10-11-12-13 AUC 0.499 7 0.644 5 0.654 4 0.546 3 0.710 8
G均值 0.000 0 0.601 6 0.611 8 0.253 8 0.691 3
abalone-20_vs_8-9-10 AUC 0.614 9 0.809 4 0.792 1 0.707 0 0.835 2
G均值 0.431 7 0.788 7 0.768 1 0.622 3 0.825 7
abalone19 AUC 0.500 0 0.672 6 0.668 5 0.508 6 0.718 1
G均值 0.000 0 0.629 6 0.625 0 0.063 9 0.704 2
Average Value of Evaluation Index Ranking by Different Oversampling Methods
Interval Distribution of AUC Obtained by Different Methods
Interval Distribution of G-Mean Obtained by Different Methods
[1] Zhao C S, Xin Y, Li X F, et al. A Heterogeneous Ensemble Learning Framework for Spam Detection in Social Networks with Imbalanced Data[J]. Applied Sciences, 2020, 10(3): 936.
doi: 10.3390/app10030936
[2] Ghorbani M, Kazi A, Baghshah M S, et al. RA-GCN: Graph Convolutional Network for Disease Prediction Problems with Imbalanced Data[J]. Medical Image Analysis, 2022, 75: 102272.
doi: 10.1016/j.media.2021.102272
[3] 肖连杰, 郜梦蕊, 苏新宁. 一种基于模糊C-均值聚类的欠采样集成不平衡数据分类算法[J]. 数据分析与知识发现, 2019, 3(4): 90-96.
[3] (Xiao Lianjie, Gao Mengrui, Su Xinning. An Under-sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data[J]. Data Analysis and Knowledge Discovery, 2019, 3(4): 90-96.)
[4] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321-357.
doi: 10.1613/jair.953
[5] Nekooeimehr I, Lai-Yuen S K. Adaptive Semi-unsupervised Weighted Oversampling (A-SUWO) for Imbalanced Datasets[J]. Expert Systems with Applications, 2016, 46: 405-416.
doi: 10.1016/j.eswa.2015.10.031
[6] Han H, Wang W Y, Mao B H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning[C]// Advances in Intelligent Computing. 2005.
[7] Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C. Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling Technique for Handling the Class Imbalanced Problem[C]// Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2009: 475-482.
[8] Pradipta G A, Wardoyo R, Musdholifah A, et al. Radius-SMOTE: A New Oversampling Technique of Minority Samples Based on Radius Distance for Learning from Imbalanced Data[J]. IEEE Access, 2021, 9: 74763-74777.
doi: 10.1109/ACCESS.2021.3080316
[9] Douzas G, Bacao F. Geometric SMOTE a Geometrically Enhanced Drop-in Replacement for SMOTE[J]. Information Sciences, 2019, 501(C): 118-135.
[10] Yang S J, Cha K J. GMOTE: Gaussian Based Minority Oversampling Technique for Imbalanced Classification Adapting Tail Probability of Outliers[OL]. arXiv Preprint, arXiv: 2105. 03855.
[11] Cheng K, Zhang C, Yu H L, et al. Grouped SMOTE with Noise Filtering Mechanism for Classifying Imbalanced Data[J]. IEEE Access, 2019, 7: 170668-170681.
doi: 10.1109/ACCESS.2019.2955086
[12] Kamalov F, Denisov D. Gamma Distribution-Based Sampling for Imbalanced Data[J]. Knowledge-Based Systems, 2020, 207: 106368.
doi: 10.1016/j.knosys.2020.106368
[13] 肖涵, 李友荣, 吕勇. 基于四分位偏差分形维与高斯混合模型的故障识别算法研究[J]. 振动工程学报, 2008, 21(1): 79-83.
[13] (Xiao Han, Li Yourong, Lv Yong. Failure Recognition Alogrithm Based on QDFD and GMM[J]. Journal of Vibration Engineering, 2008, 21(1): 79-83.)
[14] van den Oord A, Schrauwen B. Factoring Variations in Natural Images with Deep Gaussian Mixture Models[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. 2014: 3518-3526.
[15] Viroli C, McLachlan G J. Deep Gaussian Mixture Models[J]. Statistics and Computing, 2019, 29(1): 43-51.
doi: 10.1007/s11222-017-9793-z
[16] Yakowitz S. Nearest-Neighbour Methods for Time Series Analysis[J]. Journal of Time Series Analysis, 1987, 8(2): 235-247.
doi: 10.1111/j.1467-9892.1987.tb00435.x
[17] Alcalá-Fdez J, Sánchez L, García S, et al. KEEL: A Software Tool to Assess Evolutionary Algorithms for Data Mining Problems[J]. Soft Computing, 2009, 13(3): 307-318.
doi: 10.1007/s00500-008-0323-y
[18] Mazini M, Shirazi B, Mahdavi I. Anomaly Network-Based Intrusion Detection System Using a Reliable Hybrid Artificial Bee Colony and AdaBoost Algorithms[J]. Journal of King Saud University - Computer and Information Sciences, 2019, 31(4): 541-553.
doi: 10.1016/j.jksuci.2018.03.011
[19] García S, Herrera F. Evolutionary Undersampling for Classification with Imbalanced Datasets: Proposals and Taxonomy[J]. Evolutionary Computation, 2009, 17(3): 275-306.
doi: 10.1162/evco.2009.17.3.275 pmid: 19708770
[1] Weng Mengjuan,Yao Changqing,Han Hongqi,Wang Lijun,Ran Yaxin. Classification and Indexing Method with CNN for Imbalanced Datasets[J]. 数据分析与知识发现, 2020, 4(7): 87-95.
[2] Xiang Fei,Xie Yaotan. Recognition Model of Patient Reviews Based on Mixed Sampling and Transfer Learning[J]. 数据分析与知识发现, 2020, 4(2/3): 39-47.
[3] Lianjie Xiao,Mengrui Gao,Xinning Su. An Under-sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data[J]. 数据分析与知识发现, 2019, 3(4): 90-96.
[4] Jiang Cuiqing,Song Kailun,Ding Yong,Liu Yao. Identifying Potential Customers Based on User-Generated Contents[J]. 数据分析与知识发现, 2018, 2(3): 1-8.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn