Please wait a minute...
Advanced Search
数据分析与知识发现  2023, Vol. 7 Issue (5): 116-122     https://doi.org/10.11925/infotech.2096-3467.2022.0609
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于KNN和深度高斯混合模型的边界过采样方法*
张海宾1,2,肖涵1,3(),易灿灿1,3,袁锐1,3
1武汉科技大学冶金装备及其控制教育部重点实验室 武汉 430081
2武汉科技大学机械传动与制造工程湖北省重点实验室 武汉 430081
3武汉科技大学精密制造研究院 武汉 430081
A Novel Borderline Over-Sampling Method Based on KNN and Deep Gaussian Mixture Model for Imbalanced Data
Zhang Haibin1,2,Xiao Han1,3(),Yi Cancan1,3,Yuan Rui1,3
1Key Laboratory of Metallurgical Equipment and Control Technology, Ministry of Education, Wuhan University of Science and Technology, Wuhan 430081, China
2Hubei Key Laboratory of Mechanical Transmission and Manufacturing Engineering, Wuhan University of Science and Technology, Wuhan 430081, China
3Precision Manufacturing Institute, Wuhan University of Science and Technology, Wuhan 430081, China
全文: PDF (853 KB)   HTML ( 4
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 针对数据不平衡导致的分类器偏向问题,提出一种基于K-最近邻(KNN)算法和深度高斯混合模型(DGMM)的边界过采样方法。【方法】 首先,采用KNN算法获得训练集中的边界少数类样本;其次,构建该区域少数类样本的DGMM,并反向应用DGMM生成符合训练集中边界少数类样本分布特征的过采样数据;最后,采用3σ准则剔除噪声样本,循环执行直到生成的样本不存在异常值。【结果】 所提方法获得的AUC和G均值的最大提升幅度分别为8.62%和12.99%,对应的平均提升幅度分别为3.51%和4.93%。【局限】 DGMM的参数优化方法需进一步完善。【结论】 所提方法可以更好地处理数据不平衡问题。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
张海宾
肖涵
易灿灿
袁锐
关键词 数据不平衡过采样深度高斯混合模型    
Abstract

[Objective] This paper proposes a borderline oversampling method based on the k-nearest neighbor algorithm (KNN) and Deep Gaussian Mixture Model (DGMM) to address the classifier bias due to data imbalance. [Methods] Firstly, we used the KNN algorithm to obtain the borderline minority samples in the training set. Secondly, we constructed a DGMM for the minority samples. Next, we applied the DGMM in reverse to generate the oversampling samples that conform to the distribution characteristics of the borderline minority samples. Finally, we used the three sigma guidelines to remove noise samples. We repeated the process until no outlier samples were generated. [Results] The proposed method improved the AUC and G-mean up to 8.62% and 12.99%, respectively. The corresponding average increased by 3.51% and 4.93%. [Limitations] The parameter optimization method for DGMM needs further improvement. [Conclusions] The proposed method can better address the problem of imbalanced data.

Key wordsImbalanced Data    Over-Sampling    Deep Gaussian Mixture Model
收稿日期: 2022-06-14      出版日期: 2023-07-04
ZTFLH:  TP311  
基金资助:*2021年湖北省重点研发计划项目(2021BAA194);国家自然科学基金面上项目(51875416);中国博士后科学基金面上项目的研究成果之一(2020M682492)
通讯作者: 肖涵,ORCID:0000-0001-8705-7728,E-mail:coolxiaohan@163.com。   
引用本文:   
张海宾, 肖涵, 易灿灿, 袁锐. 基于KNN和深度高斯混合模型的边界过采样方法*[J]. 数据分析与知识发现, 2023, 7(5): 116-122.
Zhang Haibin, Xiao Han, Yi Cancan, Yuan Rui. A Novel Borderline Over-Sampling Method Based on KNN and Deep Gaussian Mixture Model for Imbalanced Data. Data Analysis and Knowledge Discovery, 2023, 7(5): 116-122.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0609      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I5/116
Fig.1  基于KNN和DGMM的边界过采样流程
数据集 特征数量 样本数量 IR
haberman 3 306 2.78
vehicle0 18 846 3.25
ecoli1 7 336 3.36
vowel0 13 988 9.98
abalone9-18 8 731 16.4
abalone-19_vs_10-11-12-13 8 1 622 49.69
abalone-20_vs_8-9-10 8 1 916 72.69
abalone19 8 4 174 129.44
Table 1  8组不平衡数据集的信息
数据集 评价指标 Normal SMOTE BorSMOTE SafSMOTE DGMM
haberman AUC 0.585 4 0.625 1 0.644 6 0.626 5 0.659 5
G均值 0.491 6 0.612 7 0.636 5 0.613 8 0.652 3
vehicle0 AUC 0.948 3 0.957 8 0.961 0 0.961 4 0.965 1
G均值 0.947 5 0.957 5 0.960 8 0.961 1 0.964 9
ecoli1 AUC 0.847 2 0.869 3 0.861 5 0.873 0 0.876 5
G均值 0.841 0 0.867 2 0.859 6 0.870 0 0.875 2
vowel0 AUC 0.979 0 0.985 9 0.987 1 0.985 4 0.986 5
G均值 0.978 5 0.985 7 0.986 9 0.985 2 0.986 4
abalone9-18 AUC 0.637 3 0.716 9 0.698 4 0.688 9 0.750 2
G均值 0.518 3 0.691 8 0.670 1 0.632 3 0.737 1
abalone-19_vs_10-11-12-13 AUC 0.499 7 0.644 5 0.654 4 0.546 3 0.710 8
G均值 0.000 0 0.601 6 0.611 8 0.253 8 0.691 3
abalone-20_vs_8-9-10 AUC 0.614 9 0.809 4 0.792 1 0.707 0 0.835 2
G均值 0.431 7 0.788 7 0.768 1 0.622 3 0.825 7
abalone19 AUC 0.500 0 0.672 6 0.668 5 0.508 6 0.718 1
G均值 0.000 0 0.629 6 0.625 0 0.063 9 0.704 2
Table 2  不同过采样方法获得的评价指标均值
Fig.2  不同方法获得的AUC的分布区间
Fig.3  不同方法获得的G均值的分布区间
[1] Zhao C S, Xin Y, Li X F, et al. A Heterogeneous Ensemble Learning Framework for Spam Detection in Social Networks with Imbalanced Data[J]. Applied Sciences, 2020, 10(3): 936.
doi: 10.3390/app10030936
[2] Ghorbani M, Kazi A, Baghshah M S, et al. RA-GCN: Graph Convolutional Network for Disease Prediction Problems with Imbalanced Data[J]. Medical Image Analysis, 2022, 75: 102272.
doi: 10.1016/j.media.2021.102272
[3] 肖连杰, 郜梦蕊, 苏新宁. 一种基于模糊C-均值聚类的欠采样集成不平衡数据分类算法[J]. 数据分析与知识发现, 2019, 3(4): 90-96.
[3] (Xiao Lianjie, Gao Mengrui, Su Xinning. An Under-sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data[J]. Data Analysis and Knowledge Discovery, 2019, 3(4): 90-96.)
[4] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321-357.
doi: 10.1613/jair.953
[5] Nekooeimehr I, Lai-Yuen S K. Adaptive Semi-unsupervised Weighted Oversampling (A-SUWO) for Imbalanced Datasets[J]. Expert Systems with Applications, 2016, 46: 405-416.
doi: 10.1016/j.eswa.2015.10.031
[6] Han H, Wang W Y, Mao B H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning[C]// Advances in Intelligent Computing. 2005.
[7] Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C. Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling Technique for Handling the Class Imbalanced Problem[C]// Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2009: 475-482.
[8] Pradipta G A, Wardoyo R, Musdholifah A, et al. Radius-SMOTE: A New Oversampling Technique of Minority Samples Based on Radius Distance for Learning from Imbalanced Data[J]. IEEE Access, 2021, 9: 74763-74777.
doi: 10.1109/ACCESS.2021.3080316
[9] Douzas G, Bacao F. Geometric SMOTE a Geometrically Enhanced Drop-in Replacement for SMOTE[J]. Information Sciences, 2019, 501(C): 118-135.
[10] Yang S J, Cha K J. GMOTE: Gaussian Based Minority Oversampling Technique for Imbalanced Classification Adapting Tail Probability of Outliers[OL]. arXiv Preprint, arXiv: 2105. 03855.
[11] Cheng K, Zhang C, Yu H L, et al. Grouped SMOTE with Noise Filtering Mechanism for Classifying Imbalanced Data[J]. IEEE Access, 2019, 7: 170668-170681.
doi: 10.1109/ACCESS.2019.2955086
[12] Kamalov F, Denisov D. Gamma Distribution-Based Sampling for Imbalanced Data[J]. Knowledge-Based Systems, 2020, 207: 106368.
doi: 10.1016/j.knosys.2020.106368
[13] 肖涵, 李友荣, 吕勇. 基于四分位偏差分形维与高斯混合模型的故障识别算法研究[J]. 振动工程学报, 2008, 21(1): 79-83.
[13] (Xiao Han, Li Yourong, Lv Yong. Failure Recognition Alogrithm Based on QDFD and GMM[J]. Journal of Vibration Engineering, 2008, 21(1): 79-83.)
[14] van den Oord A, Schrauwen B. Factoring Variations in Natural Images with Deep Gaussian Mixture Models[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. 2014: 3518-3526.
[15] Viroli C, McLachlan G J. Deep Gaussian Mixture Models[J]. Statistics and Computing, 2019, 29(1): 43-51.
doi: 10.1007/s11222-017-9793-z
[16] Yakowitz S. Nearest-Neighbour Methods for Time Series Analysis[J]. Journal of Time Series Analysis, 1987, 8(2): 235-247.
doi: 10.1111/j.1467-9892.1987.tb00435.x
[17] Alcalá-Fdez J, Sánchez L, García S, et al. KEEL: A Software Tool to Assess Evolutionary Algorithms for Data Mining Problems[J]. Soft Computing, 2009, 13(3): 307-318.
doi: 10.1007/s00500-008-0323-y
[18] Mazini M, Shirazi B, Mahdavi I. Anomaly Network-Based Intrusion Detection System Using a Reliable Hybrid Artificial Bee Colony and AdaBoost Algorithms[J]. Journal of King Saud University - Computer and Information Sciences, 2019, 31(4): 541-553.
doi: 10.1016/j.jksuci.2018.03.011
[19] García S, Herrera F. Evolutionary Undersampling for Classification with Imbalanced Datasets: Proposals and Taxonomy[J]. Evolutionary Computation, 2009, 17(3): 275-306.
doi: 10.1162/evco.2009.17.3.275 pmid: 19708770
[1] 徐良辰, 郭崇慧. 基于集成学习的胃癌生存预测模型研究*[J]. 数据分析与知识发现, 2021, 5(8): 86-99.
[2] 苏强, 侯校理, 邹妮. 基于机器学习组合优化方法的术后感染预测模型研究*[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn