Please wait a minute...
Advanced Search
数据分析与知识发现
  本期目录 | 过刊浏览 | 高级检索 |
基于KNN和深度高斯混合模型的边界过采样方法
张海宾,肖涵,易灿灿,袁锐
(冶金装备及其控制教育部重点实验室,武汉科技大学 武汉  430081) (机械传动与制造工程湖北省重点实验室,武汉科技大学 武汉  430081) (武汉科技大学精密制造研究院 武汉  430081)
A novel borderline over-sampling method based on KNN and Deep Gaussian Mixture Model for Imbalanced Data
ZHANG Haibin,XIAO Han,YI Cancan,YUAN Rui
(Key Laboratory of Metallurgical Equipment and Control Technology, Ministry of Education, Wuhan University of Science and Technology, Wuhan 430081, China) (Hubei Key Laboratory of Mechanical Transmission and Manufacturing Engineering, Wuhan University of Science and Technology, Wuhan 430081, China) (Precision Manufacturing Institute, Wuhan University of Science and Technology, Wuhan 430081, China)
全文:
输出: BibTeX | EndNote (RIS)      
摘要 

[目的]针对数据不平衡导致的分类器偏向问题,提出一种基于KNN和深度高斯混合模型(Deep Gaussian Mixture Model,DGMM)的边界过采样方法。[方法] 首先采用K-最近邻算法(K-Nearest Neighbor,KNN)获得训练集中的边界少数类样本;其次构建该区域少数类样本的DGMMs,并反向应用DGMM生成符合训练集中边界少数类样本分布特征的过采样数据;最后采用3σ准则剔除噪声样本,循环执行直到生成的样本不存在异常值。[结果]所提方法获得的AUC和G均值的最大提升幅度分别为5.64%和7.95%,对应的平均提升幅度分别为2.75%和3.78%。[局限]DGMM的参数优化方法需进一步完善。[结论]所提方法可以更好地处理数据不平衡问题。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
关键词 数据不平衡过采样深度高斯混合模型     
Abstract

[Objective] A borderline oversampling method based on KNN and Deep Gaussian Mixture Model is proposed to address the problem of classifier bias caused by data imbalance. [Methods] Firstly, k-nearest neighbor (KNN) algorithm is employed to obtain the borderline minority samples in the training set; Secondly, the DGMMs of the minority samples in the region are constructed, and the oversampling samples that conform to the distribution characteristics of the borderline minority samples in the training set are generated by reverse application of DGMM; Finally, with three sigma guidelines, the noise points in the generated samples are eliminated, which is executed circularly until the noise is completely eliminated. [Results] The maximum increasing amplitudes of AUC and Gmean obtained by the proposed method are 5.64% and 7.95% respectively, and the corresponding average increasing amplitudes are 2.75% and 3.78% respectively. [Limitations] The parameter optimization method for DGMM needs to be further improved. [Conclusions] The proposed method can better address the problem of data imbalance.

Key words imbalanced data    over-sampling    Deep Gaussian Mixture Model
     出版日期: 2022-11-10
ZTFLH:  TP181,TP311.13  
引用本文:   
张海宾, 肖涵, 易灿灿, 袁锐. 基于KNN和深度高斯混合模型的边界过采样方法 [J]. 数据分析与知识发现, 10.11925/infotech.2096-3467.2022-0609.
ZHANG Haibin, XIAO Han, YI Cancan, YUAN Rui. A novel borderline over-sampling method based on KNN and Deep Gaussian Mixture Model for Imbalanced Data . Data Analysis and Knowledge Discovery, 0, (): 1-.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022-0609      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y0/V/I/1
[1] 张海宾, 肖涵, 易灿灿, 袁锐. 基于KNN和深度高斯混合模型的边界过采样方法*[J]. 数据分析与知识发现, 2023, 7(5): 116-122.
[2] 徐良辰, 郭崇慧. 基于集成学习的胃癌生存预测模型研究*[J]. 数据分析与知识发现, 2021, 5(8): 86-99.
[3] 苏强, 侯校理, 邹妮. 基于机器学习组合优化方法的术后感染预测模型研究*[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn