Advanced Search
数据分析与知识发现, 2019, 3(4): 90-96
doi: 10.11925/infotech.2096-3467.2018.0533
一种基于模糊C-均值聚类的欠采样集成不平衡数据分类算法*
An Under-sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data
肖连杰, 郜梦蕊, 苏新宁

摘要:

【目的】解决二分类任务中因类间数据不平衡导致少数类分类准确度低的问题。【方法】提出一种基于模糊C-均值聚类的欠采样集成不平衡数据分类算法(ECFCM), 即对多数类样本进行基于 FCM聚类的欠采样, 将聚类中心样本与全部少数类样本组成平衡数据集; 利用基于Bagging的集成学习算法对平衡数据集进行分类。【结果】在4组不平衡数据集上的Matlab仿真实验结果表明, ECFCM算法的Acc、AUC和F1提升幅度最高为5.75% (Spambase), 13.84% (Glass2)和7.54% (Spambase)。【局限】本文采用标准数据集验证ECFCM算法的有效性, 当采用实际应用中的不平衡数据时, 需要有针对性地研究不平衡数据分类算法。【结论】ECFCM算法分类性能良好, 在一定程度上有利于提高不平衡数据中少数类的分类准确度。

关键词: 不平衡数据 ; 模糊C-均值聚类 ; 分类 ; 欠采样 ; 集成学习

Abstract:

[Objective] This paper tries to solve the problem of the low accuracy of minority classification in the binary classification task due to class imbalance. [Methods] An under-sampling ensemble classification algorithm based on fuzzy c-means(FCM) clustering for imbalanced data is proposed. That is, the majority class samples are under-sampled based on FCM clustering, all these cluster center samples and all the minority samples are made up to a balance data set. We use the integrated learning algorithm based on Bagging to classify the balanced data sets. [Results] The Matlab simulation results of experiments on four imbalanced datasets show that the ECFCM algorithm improves Acc, AUC and F1 by up to 5.75%, 13.84% and 7.54%. [Limitations] Some standard data sets are used to verify the effectiveness of ECFCM. When in a specific application, a targeted research on classification algorithm is needed. [Conclusions] The ECFCM algorithm performs good to a certain extent, which is conducive to improve the binary classification accuracy of the minority class on imbalanced datasets.

Key words: Imbalanced Data ; Fuzzy C-Means Clustering ; Classification ; Under-sampling ; Ensemble Learning

1 引 言

分类是机器学习领域重要的研究方向之一。传统的分类算法能够很好地处理平衡数据分类问题, 并且以分类总体准确度作为评价标准。然而, 实践中不同类之间的样本数量往往是不平衡的, 如医疗诊断[1]、软件缺陷预测[2]、信用卡欺诈交易监测[3]、情感分析[4]、客户流失预测[5]等, 在这些领域, 人们往往更关注少数类样本的分类结果。由于少数类的样本更容易被分类器错分, 一旦错分将对分类器的准确度产生重要影响。以二分类问题为例, 假设数据集中类间的样本数量之比为1:99, 如果以整体分类准确度为标准, 则分类器更倾向于样本数量较多的多数类, 虽然计算得出的准确度达到99%, 但是这样的分类结果显然毫无意义。因此, 设计一种优化的分类算法对不平衡数据分类显得尤为重要。

模糊C-均值聚类(Fuzzy C-Means clustering, FCM)是根据研究对象自身的属性特征构造模糊矩阵, 根据计算出的隶属度确定分类关系的聚类方法。模糊C-均值聚类算法具有分类简单、分类精度高等优点。本文从数据预处理的角度, 采用模糊C-均值聚类算法对不平衡数据进行欠采样, 以期得到平衡数据, 利用集成学习算法提高少数类的分类准确度。基本思路是在不平衡数据集上, 使用模糊C-均值聚类算法对多数类样本进行聚类欠采样, 将全部聚类中心点样本与全部少数类样本组成、平衡数据集。利用基于Bagging的集成学习算法对平衡数据集进行分类, 最终获得较好的分类性能。

2 相关研究
2.1 不平衡数据欠采样方法

针对不平衡数据中少数类分类准确度低的问题, 国内外学者展开了相关研究工作。主要的解决思路分为4个层面: 基于数据层面、基于算法层面、基于成本敏感层面和基于集成学习层面。Galar等[6]对常见的不平衡数据分类算法进行比较, 证明了数据预处理与集成学习相结合的方法分类性能最优。鉴于此, 本文提出基于模糊C-均值聚类的欠采样集成不平衡数据分类算法, 该算法属于数据层面的一种处理方法。

基于数据层面的处理方法, 研究思路是不改变初始数据集样本的空间分布, 增加少数类样本[7]或减少多数类样本[8], 从而将不平衡数据转化为近似平衡数据。众多实验表明, 数据重采样中欠采样比过采样更具有优势[9], 这是由于过采样可能会导致过拟合问题, 从而影响分类效果。因此, 本文采用欠采样方法对样本进行重采样。

欠采样方法的目的是减少多数类中的样本数量, 使得采样后的多数类样本数量与少数类样本数量基本相同, 从而降低或消除因类间样本数目不平衡对传统分类器的不利影响。在已有的欠采样方法中, 最简单的是Batista等[10]提出的随机欠采样方法, 该方法的思路是在多数类中随机抽取部分样本, 与全部少数类组成平衡训练集。该方法的不足在于, 采样过程可能会损失部分含有重要分类信息的多数类样本, 从而导致决策边界的失真。为了降低或避免在欠采样过程中多数类信息的损失, 提高分类器的性能, Zhang等[11]提出基于KNN的4种欠采样方法: NearMiss-1, NearMiss-2, NearMiss-3和MostDistant。NearMiss方法的处理思路是选择与少数类邻近的多数类样本训练决策边界。Cateni等[12]提出基于相似性的欠采样方法, 该方法可以在得到平衡数据集的同时减少样本信息的损失。Ha等[13]提出基于遗传算法的不平衡数据欠采样方法, 将分类性能作为遗传算法的适应度函数, 在一定程度上可以解决分类算法性能不稳定问题, 同时降低多数类信息损失。以上算法大都涉及对多数类执行多轮随机欠采样, 相关实验较为复杂。

针对上述研究存在的问题, Kocyigit等[14]提出一种基于FCM的欠采样方法, 即利用半监督模糊C-均值算法在数据集内部进行聚类, 选取距离聚类中心最近的属于多数类的样本, 与全部少数类样本组成平衡数据集, 分别利用SVM、LDA和NB分类器对5种图片不平衡数据集进行分类, 该采样方法简单并取得了良好的分类效果。该研究的不足之处在于选取的3种强分类器训练代价较高, 且所选取的5种不平衡数据集的不平衡率均小于5, 并未验证在不平衡率大于等于5的数据集上的分类效果。

本文将基于模糊C-均值聚类的欠采样方法与集成学习算法进行融合, 提出基于模糊C-均值聚类的欠采样集成不平衡数据分类算法(under-sampling Ensemble Classification algorithm based on Fuzzy C-Means clustering for imbalanced data, ECFCM), 以提高不平衡数据中少数类的分类准确度。

2.2 模糊C-均值聚类算法

FCM算法起源于“硬”聚类目标函数的优化过程。借助于均方逼近理论, 人们构造出带约束条件的非线性目标规划函数, 从而将聚类问题转化为非线性目标规划问题进行求解。因此, 类内平方误差和(Within-Groups Sum of Squared error, WGSS) J1通常被当做聚类目标函数使用。以后的研究中, Dunn[15]将WGSS函数J1扩展到J2, 即类内加权平方误差和函数。Bezdek等[16]引入新的参数m, 将J2推广到一个目标函数的无限簇, 形成目前所熟知的FCM算法。

设$X=\{{{x}_{1}},{{x}_{2}},\cdots ,{{x}_{n}}\}$为n元数据集合, ${{x}_{i}}\in {{R}^{s}}$。FCM聚类方法是将X划分为c个子集${{S}_{1}},{{S}_{2}},\cdots {{S}_{n}}$, 若用$A=\{{{\alpha }_{1}},{{\alpha }_{2}},\cdots ,{{\alpha }_{c}}\}$表示c个子集的聚类中心, ${{\mu }_{ij}}$表示元素xiSi的隶属度, 则FCM算法的优化目标函数如公式(1)所示。

$J_{FCM}^{m}(U,A,X)=\sum\limits_{i=1}^{c}{\sum\limits_{j=1}^{n}{\mu _{ij}^{m}}}d_{ij}^{2}=\sum\limits_{i=1}^{c}{\sum\limits_{j=1}^{n}{\mu _{ij}^{m}}}\left\| {{x}_{j}}-{{\alpha }_{i}} \right\|$ (1)

${{\mu }_{ij}}$满足公式(2)所示的约束条件:

$\begin{align} & \sum\limits_{i=1}^{c}{{{\mu }_{ij}}}=1\ \ (1\le j\le n) \\ & {{\mu }_{ij}}\ge 0\ \ (1\le i\le c,1\le j\le n) \\ \end{align}$ (2)

其中, $U=\{{{\mu }_{ij}}\}$表示$c\times n$阶矩阵, $A=\{{{\alpha }_{1}},{{\alpha }_{2}},\cdots ,{{\alpha }_{c}}\}$表示$s\times c$阶矩阵, dij表示xj与${{\alpha }_{i}}$之间的欧式距离。m表示模糊指数, 控制分类矩阵U的模糊程度, m的取值越大, 分类的模糊程度越高, 在实际应用中m的最佳范围为(1.5, 2.5), 通常取m=2[17]。FCM算法通过不断迭代使得目标函数达到最小值。在使得$J_{FCM}^{m}$达到最小值时, ${{\mu }_{ij}}$是按照拉格朗日乘数法得到, 如公式(3)和公式(4)所示。

${{\alpha }_{i}}=\frac{\sum\limits_{j=1}^{n}{{{({{\mu }_{ij}})}^{m}}{{x}_{j}}}}{\sum\limits_{j=1}^{n}{{{({{\mu }_{ij}})}^{m}}}}$ (3)

${{\mu }_{ij}}=\frac{1}{\sum\limits_{k=1}^{c}{{{\left( \frac{{{d}_{ij}}}{{{d}_{kj}}} \right)}^{\frac{2}{m-1}}}}}$ (4)

由此可知, FCM算法是通过动态更新聚类中心矩阵和隶属度矩阵, 从而实现分类的过程。大量实验证明, FCM算法可以从任一给定的初始点开始, 沿着某个迭代子序列收敛于目标函数Jm(U,P)的局部极小值点或鞍点。

3 基于模糊C-均值聚类欠采样的集成不平衡数据分类
3.1 概念定义

定义1 (分类数据样本): 约定由一组属性值和一个类标签共同描述的数据集合为分类数据样本, 表示为$x=({{x}_{1}},{{x}_{2}},\cdots ,{{x}_{m}},y)$。其中$({{x}_{1}},{{x}_{2}},\cdots ,{{x}_{m}})$表示数据样本的属性向量, m表示属性个数, y表示类标签。因本文只讨论二分类问题, 故y取值0或1, 且约定少数类数据样本(y取值为0)为正类数据样本。

定义2 (平衡数据集、不平衡数据集): 给定一个数据集$d=\{{{x}^{i}},{{y}^{i}}\}_{i=1}^{n}$, 其中${{d}_{1}}=\{{{x}^{i}},0\}_{i=1}^{{{n}_{1}}}$表示类标签为0的分类数据子集, ${{d}_{2}}=\{{{x}^{i}},1\}_{i=1}^{{{n}_{2}}}$表示类标签为1的分类数据子集, n1n2分别是数据子集d1d2包含的样本总数, 且n1+n2=n, 当且仅当数据集d满足${{d}_{1}}/d\approx {{d}_{2}}/d$时, 该数据集为平衡数据集, 否则为不平衡数据集。

3.2 方法定义

(1) 基于模糊C-均值聚类的欠采样过程

ECFCM算法主要有两个过程: 首先, 利用FCM算法对多数类样本进行聚类欠采样, 将k值设置为少数类的样本数量(即$k={{n}_{1}}$), k个聚类中心构成的样本代替原多数类数据样本。故而, 得到的平衡数据集由数量相近的两类数据样本组成。其次, 在得到的平衡数据集上训练分类器, 并进行多次分类实验。

相比于随机欠采样, 基于模糊C-均值聚类的欠采样能够在减少多数类样本数量的同时, 尽可能避免多数类数据样本信息丢失。欠采样过程描述如下:

输入: 不平衡数据集$d$

输出: 平衡数据集${d}'$

初始化: ${d}'=\varnothing $, ${{d}_{2}}=\varnothing $,$k=0$

For ${{x}^{i}}\in d$

If ${{y}^{i}}=0$

${d}'={d}'\bigcup \{{{x}^{i}}\}$

Else

${{d}_{2}}={{d}_{2}}\bigcup \{{{x}^{i}}\}$

EndIf

$k=card({d}')$

${{{d}'}_{2}}=fcm-Cluster(k,{{d}_{2}})$

${d}'={d}'\bigcup {{{d}'}_{2}}$

EndFor

上述伪代码描述了基于模糊C-均值聚类的欠采样过程。在欠采样过程中, 首先将数据样本集合划分为少数类样本集合与多数类样本集合; 其次对多数类样本进行聚类, 将得到的聚类中心点与全部少数类样本组成平衡数据集。经过欠采样后, 要保证多数类的样本数量与少数类的样本数量平衡, 因此将多数类样本聚类的数目设置为少数类样本数目。

经过模糊C-均值聚类的多数类样本数量虽然减少, 但数据的空间分布特征并未改变, 聚类前与聚类后数据的空间分布如图1图2所示。基于模糊C-均值聚类的欠采样是利用聚类方法在数据空间分布特征不变的前提下, 使得类间数据由不平衡转化为平衡。集成分类过程则是基于集成学习算法在平衡数据上进行规则学习, 从而输出分类模型。

图1 聚类前数据特征分布

图2 聚类后数据特征分布

(2) 集成学习训练过程

学习主要有两种, 即基分类器之间不存在强依赖关系, 可以并行生成分类模型, 以及基分类器之间存在强依赖关系, 必须串行生成分类模型, 前者典型代表是Bagging算法, 后者典型代表是Boosting算法[18]。由于Boosting算法在一些实际问题的处理中会导致过拟合(Over-fitting)问题, 可能得到比单个分类器更差的分类结果[19], 而Bagging算法则可以较好地避免此类问题, 因此本文选用Bagging算法作为集成学习的分类器。

Bagging实现的步骤为: 给定一个基分类器和T个训练集, 每个训练集由初始数据集(样本总数为N)中随机取$n(n\approx 63.2%\times N)$个样本组成, 基分类器训练T轮得到T个预测函数Ft, 用T个预测函数分别对测试集进行预测, 然后按照多数投票法得到最终的预测结果。Bagging实现步骤如图3所示。

图3 Bagging的实现步骤

(3) 集成学习测试过程

假设基分类器的计算复杂度为O(m), 采样与投票/平均的计算复杂度为O(s), 则Bagging的计算复杂度约为t(O(m)+O(s)), O(s)一般较小, 则t是一个较小的常数, 因此可以看出Bagging是一种较为高效的集成学习框架。除此之外, Bagging是基于自助采样法, 自助采样的优势体现在: 由于每个基分类器只使用初始训练集中约63.2%的样本, 其他36.8%的样本作为验证集对泛化性能进行包外估计。

当基分类器是决策树时, 包外样本可用于估计决策树中各节点的后验概率, 并对零训练样本节点进行辅助处理。在经典分类算法中, KNN是一种时间复杂度较低的算法, 常用于Bagging集成学习[20]。基于上述考虑, 本文选择Decision-Stump与KNN作为集成学习算法中的基分类器。

在对ECFCM算法进行验证时, 为了保证实验客观性, 采用10折交叉验证方法进行验证, 即将数据集中的数据随机分为10等份, 取其中9份作为训练集, 1份作为测试集依次进行实验, 对应的10次实验结果的平均值作为1次10折交叉验证的最终结果。

4 实验分析
4.1 实验数据集

为了检验ECFCM算法处理较小不平衡率(不平衡率小于5)、较大不平衡率(不平衡率大于5)数据时的有效性, 选取UCI(①http://archive.ics.uci.edu/ml/datasets.html.)、KEEL(②http://www.keel.es/.)共4组不平衡数据集进行Matlab 仿真。4组不平衡数据集属性如表1所示。

表1 4组不平衡数据集

4.2 实验评价指标

约定在不平衡数据集中少数类样本为P, 多数类样本为N, 构建混淆矩阵如表2所示。

表2 二分类混淆矩阵

TN是指原本是多数类, 预测仍然是多数类。根据混淆矩阵, 定义整体准确度(Acc)、正确率(TPR)、查全率(Recall)、小类准确度(MIA)、查准率(Precision)、 F-measure、AUC等一系列评价指标。

分类器的整体准确度计算如公式(5)所示。

$Acc=\frac{TP+TN}{TP+TN+FP+FN}$ (5)

分类器的查全率, 也称为小类准确度或者正准确度, 其计算如公式(6)所示。

$TPR=Recall=MIA=\frac{TP}{TP+FN}$ (6)

分类器的查准率, 也是小类样本的查准率, 其计算如公式(7)所示。

$Precision=\frac{TP}{TP+FP}$ (7)

F-measure也是面向不平衡数据集分类的一个评价标准, 如公式(8)所示。

$F\mathrm{-}measure=\frac{(1+{{b}^{2}})\times Recall\times Precision}{{{b}^{2}}\times Recall+Precision}$ (8)

b是一个调节PrecisionRecall的系数, 通常情况取值1, 即F1F1值通常被用于不平衡分类性能的评价。相比于G-mean, 该标准更加注重对少数类分类性能的评价。

ROC曲线是以假正率(FP)为x轴, 以真正率(TP)为y轴绘制的二维图。ROC曲线下方面积用AUC表示。AUC值常用于评价分类器性能, 因此本文采用AUC值作为评价分类器的指标之一。

4.3 实验平台

实验环境为64位Mac OS Sierra操作系统, 8GB内存, Intel Core i5处理器, 使用Matlab R2017a进行仿真实验, 数据预处理采用Classification Learner应用程序包。

4.4 结果分析

通过仿真实验, 比较Decision-Stump算法、KNN算法、基于Decision-Stump的Bagging算法、基于KNN的Bagging算法的分类性能。其中, 前两者属于传统分类算法, 后两者属于基于基分类器的集成分类算法。实验结果如表3表6所示。表中加黑数字表示该性能指标最优的实验结果。

表3 利用数据集Spambase进行实验各性能指标对比

表4 利用数据集Ionosphere进行实验各性能指标对比

表5 利用数据集Segment0进行实验各性能指标对比

表6 利用数据集Glass2进行实验各性能指标对比

通过表3表6, 在平衡数据与不平衡数据上分别计算ECFCM算法的Acc、AUC和F1值提升幅度平均值, 各指标的变化采用百分比表示(+代表提升, -代表下降), 如表7所示。

表7 ECFCM算法分类性能提升情况(%)

利用ECFCM算法进行分类, 算法的Acc、AUC和F1值平均提升最高达5.75%(Spambase), 13.84% (Glass2)和7.54% (Spambase)。在Segment0和Glass2数据集上, 算法的Acc值有所下降, 下降幅度分别为0.26%和9.84%。分析可知, 这是由于两个数据集的不平衡率分别为6.02、11.59, 数据集的不平率越大, 在多数类中进行欠采样时丢失的有效样本就越多, 例如处于分类边界的数据样本的丢失会影响分类结果, 从而导致分类准确度下降。

相比于4组不平衡数据, ECFCM算法在4组平衡数据集上的AUC、F1值平均提升5.88%和2.73%。在平衡数据集上, Bagging(Decision-Stump)算法相比于其他三种算法在三种指标上表现更优, 可见Bagging (Decision-Stump)算法更适合于对少数类样本进行分类; 同时, Decision-Stump和Bagging (Decision-Stump)算法的F1值均高于不平衡数据集上各自的F1值, 说明ECFCM算法能够提高少数类样本的查全率和查准率, 即对少数类样本的分类效果较好。

综上, ECFCM算法可以有效地提高不平衡数据分类的综合性能, 特别是在不平衡数据的不平衡率小于5时, 对提高少数类样本的分类性能效果明显, 证明了ECFCM算法在不平衡数据分类中的可行性和有效性。

5 结 语

针对二分类任务中因类间数据不平衡导致少数类分类准确度低的问题, 本文结合模糊C-均值聚类算法、集成学习算法, 提出一种基于模糊C-均值聚类的欠采样集成不平衡数据分类算法——ECFCM。借助Bagging集成学习算法在不平衡数据分类中的优势, 利用4组UCI、KEEL不平衡标准数据集进行Matlab仿真实验。结果表明, 在不平衡率小于5的数据集上(Spambase和Ionosphere), ECFCM算法的Acc值有所提升; 在不平衡率大于5的数据集上(Segment0和Glass2), ECFCM算法的Acc值有所下降, 分析原因可能是对多数类进行基于模糊C-均值聚类的欠采样时丢失了较多有效样本, 而这些样本包含重要的分类信息。ECFCM算法在4组数据集上的AUC、F1值较已有算法有较明显的提升, 证明了ECFCM算法在不平衡数据分类中可以提高少数类样本的分类准确度。本文只采用标准数据集对ECFCM算法的分类性能进行验证, 实际应用中的数据集更加复杂, 例如极端不平衡数据、多类别分类问题等, 如何处理此类问题将是下一步研究的重点。

作者贡献声明

肖连杰: 提出研究思路, 设计研究方案, 起草并修改论文;

肖连杰, 郜梦蕊: 数据收集、仿真实验;

苏新宁: 论文修改。

利益冲突声明

所有作者声明不存在利益冲突关系。

支撑数据

支撑数据由作者自存储, E-mail: 1061939301@qq.com。

[1] 肖连杰. 数据集. rar. UCI数据集及KEEL数据集.

[2] 肖连杰. 算法. rar. ECFCM算法.

参考文献

[1] He H, Garcia E A.Learning from Imbalanced Data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.
DOI:10.1109/TKDE.2008.239      URL     [本文引用:1]
[2] Yang X, Lo D, Huang Q, et al.Automated Identification of High Impact Bug Reports Leveraging Imbalanced Learning Strategies[C]//Proceedings of the 40th IEEE Annual Computer Software and Applications Conference, Atlanta, Georgia,USA. IEEE Press, 2016: 227-232.
[本文引用:1]
[3] Zakaryazad A, Duman E.A Profit-driven Artificial Neural Network (ANN) with Applications to Fraud Detection and Direct Marketing[J]. Neurocomputing, 2016, 175: 121-131.
The rapid growth in data capture and computational power has led to an increasing focus on data-driven research. So far, most of the research is focused on predictive modeling using statistical optimization, while profit maximization has been given less priority. It is exactly this gap that will be addressed in this study by taking a profit-driven approach to develop a profit-driven Artificial Neural Network (ANN) classification technique. In order to do this, we have first introduced an ANN model with a new penalty function which gives variable penalties to the misclassification of instances considering their individual importance (profit of correctly classification and/or cost of misclassification) and then we have considered maximizing the total net profit. In order to generate individual penalties, we have modified the sum of squared errors (SSE) function by changing its values with respect to profit of each instance. We have implemented different versions of ANN of which five of them are new ones contributed in this study and two benchmarks from relevant literature. We appraise the effectiveness of the proposed models on two real-life data sets from fraud detection and a University of California Irvine (UCI) repository data set about bank direct marketing. For the comparison, we have considered both statistical and profit-driven performance metrics. Empirical results revealed that, although in most cases the statistical performance of new models are not better than previous ones, they turn out to be better when profit is the concern.
DOI:10.1016/j.neucom.2015.10.042      URL     [本文引用:1]
[4] Prusa J D, Khoshgoftaar T M, Seliya N.Enhancing Ensemble Learners with Data Sampling on High-Dimensional Imbalanced Tweet Sentiment Data[C]//Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference(FLAIRS2016), Florida, USA. AAAI Press, 2016: 322-328.
[本文引用:1]
[5] 方磊, 马溪骏. 基于信息熵的改进型支持向量机客户流失预测模型应用研究[J]. 情报学报, 2011, 30(6):643-648.
客户流失数据是一类的非平衡数据集,如何有效地对其进行分类学 习,其关键是要提高少数类(流失客户)的识别率,少数类是相对多数类而言的一类特殊的子样本,其错分的代价非常高,因此,有效地减少少数类的错分率是一个 亟待解决的问题.本文在Veropoulous提出的采用不同惩罚因子数的支持向量机算法基础上,利用样本自身信息熵值来确定不同的惩罚因子,使模型更加 倾向于提高少数类的识别精度,并在电信客户流失数据这一非平衡数据集中进行了验证,结果表明该方法较其他方法对流失客户(少数类)的识别率有很大的提高, 具有很强的实际应用意义.
DOI:10.3772/j.issn.1000-0135.2011.06.012      URL     [本文引用:1]
(Fang Lei, Ma Xijun.An Applied Research on Improved Entropy-based SVM Churn Prediction Model[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(6): 643-648.)
[6] Galar M, Fernandez A, Barrenechea E, et al.A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches[J]. IEEE Transactions on Systems, Man & Cybernetics, Part C:Applications & Reviews, 2012, 42(4): 463-484.
Classifier learning with data-sets that suffer from imbalanced class distributions is a challenging problem in data mining community. This issue occurs when the number of examples that represent one class is much lower than the ones of the other classes. Its presence in many real-world applications has brought along a growth of attention from researchers. In machine learning, the ensemble of classifiers are known to increase the accuracy of single classifiers by combining several of them, but neither of these learning techniques alone solve the class imbalance problem, to deal with this issue the ensemble learning algorithms have to be designed specifically. In this paper, our aim is to review the state of the art on ensemble techniques in the framework of imbalanced data-sets, with focus on two-class problems. We propose a taxonomy for ensemble-based methods to address the class imbalance where each proposal can be categorized depending on the inner ensemble methodology in which it is based. In addition, we develop a thorough empirical comparison by the consideration of the most significant published approaches, within the families of the taxonomy proposed, to show whether any of them makes a difference. This comparison has shown the good behavior of the simplest approaches which combine random undersampling techniques with bagging or boosting ensembles. In addition, the positive synergy between sampling techniques and bagging has stood out. Furthermore, our results show empirically that ensemble-based algorithms are worthwhile since they outperform the mere use of preprocessing techniques before learning the classifier, therefore justifying the increase of complexity by means of a significant enhancement of the results.
DOI:10.1109/TSMCC.2011.2161285      URL     [本文引用:1]
[7] Liu G, Yang Y, Li B.Fuzzy Rule-based Oversampling Technique for Imbalanced and Incomplete Data Learning[J]. Knowledge-Based Systems, 2018, 158: 154-174.
Datasets that have skewed class distributions pose a difficulty to learning algorithms in pattern classification. A number of different methods to deal with this problem have been developed in recent years. Specifically, synthetic oversampling techniques focus on balancing the distribution between the training instances of the majority and minority classes by generating extra artificial minority class instances. Unfortunately, few of them can be spread to tackle the problem of imbalanced data with missing values. Moreover, in most cases, existing oversampling methods do not make full use of the correlation between attributes. To this end, in this paper, we propose a fuzzy rule-based oversampling technique (FRO) to handle the class imbalance problem. FRO firstly creates fuzzy rules from the training data and assigns each of them a rule weight, which represents the certainty degree of an instance belonging to the fuzzy subspace. Then it synthesizes new minority instances under the guidance of fuzzy rules. The number of minority instances to be generated under a given fuzzy rule is determined by the rule weight. In a similar way, FRO can also recover the missing values that exist in the imbalanced dataset. Extensive experiments using 55 real-world imbalanced datasets evaluate the performance of the proposed FRO technique. The results show that our method is better than or comparable with a set of alternative state-of-the-art imbalanced classification algorithms in terms of various assessment metrics.
DOI:10.1016/j.knosys.2018.05.044      URL     [本文引用:1]
[8] Lin W C, Tsai C F, Hu Y H, et al. Clustering-based Undersampling in Class-imbalanced Data[J]. Information Sciences, 2017, 409-410: 17-26.
Class imbalance is often a problem in various real-world data sets, where one class (i.e. the minority class) contains a small number of data points and the other (i.e. the majority class) contains a large number of data points. It is notably difficult to develop an effective model using current data mining and machine learning algorithms without considering data preprocessing to balance the imbalanced data sets. Random undersampling and oversampling have been used in numerous studies to ensure that the different classes contain the same number of data points. A classifier ensemble (i.e. a structure containing several classifiers) can be trained on several different balanced data sets for later classification purposes. In this paper, we introduce two undersampling strategies in which a clustering technique is used during the data preprocessing step. Specifically, the number of clusters in the majority class is set to be equal to the number of data points in the minority class. The first strategy uses the cluster centers to represent the majority class, whereas the second strategy uses the nearest neighbors of the cluster centers. A further study was conducted to examine the effect on performance of the addition or deletion of 5 to 10 cluster centers in the majority class. The experimental results obtained using 44 small-scale and 2 large-scale data sets revealed that the clustering-based undersampling approach with the second strategy outperformed five state-of-the-art approaches. Specifically, this approach combined with a single multilayer perceptron classifier and C4.5 decision tree classifier ensembles delivered optimal performance over both small- and large-scale data sets.
DOI:10.1016/j.ins.2017.05.008      URL     [本文引用:1]
[9] Błaszczyński J, Stefanowski J.Neighbourhood Sampling in Bagging for Imbalanced Data[J]. Neurocomputing, 2015, 150: 529-542.
Various approaches to extend bagging ensembles for class imbalanced data are considered. First, we review known extensions and compare them in a comprehensive experimental study. The results show that integrating bagging with under-sampling is more powerful than over-sampling. They also allow to distinguish Roughly Balanced Bagging as the most accurate extension. Then, we point out that complex and difficult distribution of the minority class can be handled by analyzing the content of a neighbourhood of examples. In our study we show that taking into account such local characteristics of the minority class distribution can be useful both for analyzing performance of ensembles with respect to data difficulty factors and for proposing new generalizations of bagging. We demonstrate it by proposing Neighbourhood Balanced Bagging, where sampling probabilities of examples are modified according to the class distribution in their neighbourhood. Two of its versions are considered: the first one keeping a larger size of bootstrap samples by hybrid over-sampling and the other reducing this size with stronger under-sampling. Experiments prove that the first version is significantly better than existing over-sampling bagging extensions while the other version is competitive to Roughly Balanced Bagging. Finally, we demonstrate that detecting types of minority examples depending on their neighbourhood may help explain why some ensembles work better for imbalanced data than others.
DOI:10.1016/j.neucom.2014.07.064      URL     [本文引用:1]
[10] Batista G E A P A, Prati R C, Monard M C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29.
DOI:10.1145/1007730      URL     [本文引用:1]
[11] Zhang J, Mani I. kNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction [C]// Proceedings of the ICML2003 Workshop on Learning from Imbalanced Datasets, Washington, DC, USA. AAAI Press, 2003: 42-48.
[本文引用:1]
[12] Cateni S, Colla V, Vannucci M.A Method for Resampling Imbalanced Datasets in Binary Classification Tasks for Real-World Problems[J]. Neurocomputing, 2014, 135: 32-41.
The paper presents a novel resampling method for binary classification problems on imbalanced datasets. Imbalanced datasets are frequently found in many industrial applications: for instance, the occurrence of particular product defects, the diagnosis of severe diseases in a series of patients or machine faults are rare events whose detection is of utmost importance. In this paper a new resampling method is proposed combining an oversampling and an undersampling technique. Several tests have been developed aiming at assessing the efficiency of the proposed method. Four classifiers based, respectively, on Support Vector Machine, Decision Tree, labelled Self-Organizing Map and Bayesian Classifiers have been developed and applied for binary classification on the following four datasets: a synthetic dataset, a widely used public dataset and two datasets coming from industrial applications. The results that have been obtained in the tests are presented and discussed in the paper; in particular, the performances that are achieved by the four classifiers through the proposed novel resampling approach have been compared to the ones that are obtained, without any resampling, through a widely applied and well known resampling technique, i.e. the classical SMOTE approach, and through another approach coupling informed SMOTE-based oversampling and informed clustering-based undersampling.
DOI:10.1016/j.neucom.2013.05.059      URL     [本文引用:1]
[13] Ha J, Lee J S.A New Under-Sampling Method Using Genetic Algorithm for Imbalanced Data Classification [C] //Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication, Danang, Vietnam. ACM Press, 2016: Article No.95.
[本文引用:1]
[14] Kocyigit Y, Seker H.Imbalanced Data Classifier by Using Ensemble Fuzzy C-Means Clustering[C]// Proceedings of the IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2012), Hong Kong, China. IEEE Press, 2012: 952-955.
[本文引用:1]
[15] Dunn J C.A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-separated Clusters[J]. Journal of Cybernetics, 1973, 3(3): 32-57.
Two fuzzy versions of the k-means optimal, least squared error partitioning problem are formulated for finite subsets X of a general inner product space. In both cases, the extremizing solutions are shown to be fixed points of a certain operator T on the class of fuzzy, k-partitions of X, and simple iteration of T provides an algorithm which has the descent property relative to the least squared error criterion function. In the first case, the range of T consists largely of ordinary (i.e. non-fuzzy) partitions of X and the associated iteration scheme is essentially the well known ISODATA process of Ball and Hall. However, in the second case, the range of T consists mainly of fuzzy partitions and the associated algorithm is new; when X consists of k compact well separated (CWS) clusters, Xi, this algorithm generates a limiting partition with membership functions which closely approximate the characteristic functions of the clusters Xi. However, when X is not the union of k CWS clusters, the limiting partition is truly fuzzy in the sense that the values of its component membership functions differ substantially from 0 or 1 over certain regions of X. Thus, unlike ISODATA, the 090008fuzzy090009 algorithm signals the presence or absence of CWS clusters in X. Furthermore, the fuzzy algorithm seems significantly less prone to the 090008cluster-splitting090009 tendency of ISODATA and may also be less easily diverted to uninteresting locally optimal partitions. Finally, for data sets X consisting of dense CWS clusters embedded in a diffuse background of strays, the structure of X is accurately reflected in the limiting partition generated by the fuzzy algorithm. Mathematical arguments and numerical results are offered in support of the foregoing assertions.
DOI:10.1080/01969727308546046      URL     [本文引用:1]
[16] Bezdek J C, Ehrlich R, Full W.FCM: The Fuzzy C-Means Clustering Algorithm[J]. Computers & Geosciences, 1984, 10(2-3): 191-203.
This paper transmits a FORTRAN-IV coding of the fuzzy c-means (FCM) clustering program. The FCM program is applicable to a wide variety of geostatistical data analysis problems. This program generates fuzzy partitions and prototypes for any set of numerical data. These partitions are useful for corroborating known substructures or suggesting substructure in unexplored data. The clustering criterion used to aggregate subsets is a generalized least-squares objective function. Features of this program include a choice of three norms (Euclidean, Diagonal, or Mahalonobis), an adjustable weighting factor that essentially controls sensitivity to noise, acceptance of variable numbers of clusters, and outputs that include several measures of cluster validity.
DOI:10.1016/0098-3004(84)90020-7      URL     [本文引用:1]
[17] 蔡静颖. 模糊聚类算法及应用[M]. 北京: 冶金工业出版社, 2015.
[本文引用:1]
(Cai Jingying.Fuzzy Clustering Algorithm and Applications[M]. Beijing: Metallurgical Industry Press, 2015.)
[18] 张翔, 周明全, 耿国华, . Bagging算法在中文文本分类中的应用[J]. 计算机工程与应用, 2009, 45(5): 135-137, 179.
Bagging算法是目前一种流行的集成学习算法,采用一种改进的Bagging算法Attribute Bagging作为分类算法,通过属性重取样获取多个训练集,以kNN为弱分类器设计一种中文文本分类器。实验结果表明Attribute Bagging算法较Bagging算法有更好的分类精度。
DOI:10.3778/j.issn.1002-8331.2009.05.039      Magsci     URL     [本文引用:1]
(Zhang Xiang, Zhou Mingquan, Geng Guohua, et al.Application of Bagging Algorithm to Chinese Text Categorization[J]. Computer Engineering and Applications, 2009, 45(5): 135-137, 179.)
[19] 沈学华, 周志华, 吴建鑫, . Boosting和Bagging综述[J]. 计算机工程与应用, 2000, 36(12): 31-32, 40.
[本文引用:1]
(Shen Xuehua, Zhou Zhihua, Wu Jianxin, et al.Survey of Boosting and Bagging[J]. Computer Engineering and Applications, 2000, 36(12): 31-32, 40.)
[20] 毛国君, 段立娟. 数据挖掘原理与算法 [M]. 第3版. 北京:清华大学出版社, 2016.
[本文引用:1]
(Mao Guojun, Duan Lijuan.The Principle and Algorithm of Data Mining [M]. The Third Edition. Beijing: Tsinghua University Press, 2016.)
资源
PDF下载数    
RichHTML 浏览数    
摘要点击数    

分享
导出

相关文章:
关键词(key words)
不平衡数据
模糊C-均值聚类
分类
欠采样
集成学习

Imbalanced Data
Fuzzy C-Means Clustering
Classification
Under-sampling
Ensemble Learning

作者
肖连杰
郜梦蕊
苏新宁

Xiao Lianjie
Gao Mengrui
Su Xinning
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn