Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (4): 90-96    DOI: 10.11925/infotech.2096-3467.2018.0533
Current Issue | Archive | Adv Search |
An Under-sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data
Lianjie Xiao(),Mengrui Gao,Xinning Su
School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
Download: PDF (973 KB)   HTML ( 4
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to solve the problem of the low accuracy of minority classification in the binary classification task due to class imbalance. [Methods] An under-sampling ensemble classification algorithm based on fuzzy c-means(FCM) clustering for imbalanced data is proposed. That is, the majority class samples are under-sampled based on FCM clustering, all these cluster center samples and all the minority samples are made up to a balance data set. We use the integrated learning algorithm based on Bagging to classify the balanced data sets. [Results] The Matlab simulation results of experiments on four imbalanced datasets show that the ECFCM algorithm improves Acc, AUC and F1 by up to 5.75%, 13.84% and 7.54%. [Limitations] Some standard data sets are used to verify the effectiveness of ECFCM. When in a specific application, a targeted research on classification algorithm is needed. [Conclusions] The ECFCM algorithm performs good to a certain extent, which is conducive to improve the binary classification accuracy of the minority class on imbalanced datasets.

Key wordsImbalanced Data      Fuzzy C-Means Clustering      Classification      Under-sampling      Ensemble Learning     
Received: 11 May 2018      Published: 29 May 2019

Cite this article:

Lianjie Xiao,Mengrui Gao,Xinning Su. An Under-sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data. Data Analysis and Knowledge Discovery, 2019, 3(4): 90-96.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.0533     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I4/90

[1] He H, Garcia E A.Learning from Imbalanced Data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.
[2] Yang X, Lo D, Huang Q, et al.Automated Identification of High Impact Bug Reports Leveraging Imbalanced Learning Strategies[C]//Proceedings of the 40th IEEE Annual Computer Software and Applications Conference, Atlanta, Georgia,USA. IEEE Press, 2016: 227-232.
[3] Zakaryazad A, Duman E.A Profit-driven Artificial Neural Network (ANN) with Applications to Fraud Detection and Direct Marketing[J]. Neurocomputing, 2016, 175: 121-131.
[4] Prusa J D, Khoshgoftaar T M, Seliya N.Enhancing Ensemble Learners with Data Sampling on High-Dimensional Imbalanced Tweet Sentiment Data[C]//Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference(FLAIRS2016), Florida, USA. AAAI Press, 2016: 322-328.
[5] 方磊, 马溪骏. 基于信息熵的改进型支持向量机客户流失预测模型应用研究[J]. 情报学报, 2011, 30(6):643-648.
[5] (Fang Lei, Ma Xijun.An Applied Research on Improved Entropy-based SVM Churn Prediction Model[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(6): 643-648.)
[6] Galar M, Fernandez A, Barrenechea E, et al.A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches[J]. IEEE Transactions on Systems, Man & Cybernetics, Part C:Applications & Reviews, 2012, 42(4): 463-484.
[7] Liu G, Yang Y, Li B.Fuzzy Rule-based Oversampling Technique for Imbalanced and Incomplete Data Learning[J]. Knowledge-Based Systems, 2018, 158: 154-174.
[8] Lin W C, Tsai C F, Hu Y H, et al. Clustering-based Undersampling in Class-imbalanced Data[J]. Information Sciences, 2017, 409-410: 17-26.
[9] Błaszczyński J, Stefanowski J.Neighbourhood Sampling in Bagging for Imbalanced Data[J]. Neurocomputing, 2015, 150: 529-542.
[10] Batista G E A P A, Prati R C, Monard M C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29.
[11] Zhang J, Mani I. kNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction [C]// Proceedings of the ICML2003 Workshop on Learning from Imbalanced Datasets, Washington, DC, USA. AAAI Press, 2003: 42-48.
[12] Cateni S, Colla V, Vannucci M.A Method for Resampling Imbalanced Datasets in Binary Classification Tasks for Real-World Problems[J]. Neurocomputing, 2014, 135: 32-41.
[13] Ha J, Lee J S.A New Under-Sampling Method Using Genetic Algorithm for Imbalanced Data Classification [C] //Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication, Danang, Vietnam. ACM Press, 2016: Article No.95.
[14] Kocyigit Y, Seker H.Imbalanced Data Classifier by Using Ensemble Fuzzy C-Means Clustering[C]// Proceedings of the IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2012), Hong Kong, China. IEEE Press, 2012: 952-955.
[15] Dunn J C.A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-separated Clusters[J]. Journal of Cybernetics, 1973, 3(3): 32-57.
[16] Bezdek J C, Ehrlich R, Full W.FCM: The Fuzzy C-Means Clustering Algorithm[J]. Computers & Geosciences, 1984, 10(2-3): 191-203.
[17] 蔡静颖. 模糊聚类算法及应用[M]. 北京: 冶金工业出版社, 2015.
[17] (Cai Jingying.Fuzzy Clustering Algorithm and Applications[M]. Beijing: Metallurgical Industry Press, 2015.)
[18] 张翔, 周明全, 耿国华, 等. Bagging算法在中文文本分类中的应用[J]. 计算机工程与应用, 2009, 45(5): 135-137, 179.
[18] (Zhang Xiang, Zhou Mingquan, Geng Guohua, et al.Application of Bagging Algorithm to Chinese Text Categorization[J]. Computer Engineering and Applications, 2009, 45(5): 135-137, 179.)
[19] 沈学华, 周志华, 吴建鑫, 等. Boosting和Bagging综述[J]. 计算机工程与应用, 2000, 36(12): 31-32, 40.
[19] (Shen Xuehua, Zhou Zhihua, Wu Jianxin, et al.Survey of Boosting and Bagging[J]. Computer Engineering and Applications, 2000, 36(12): 31-32, 40.)
[20] 毛国君, 段立娟. 数据挖掘原理与算法 [M]. 第3版. 北京:清华大学出版社, 2016.
[20] (Mao Guojun, Duan Lijuan.The Principle and Algorithm of Data Mining [M]. The Third Edition. Beijing: Tsinghua University Press, 2016.)
[1] Fan Shaoping,Zhao Yuxuan,An Xinying,Wu Qingqiang. Classification Model for Medical Entity Relations with Convolutional Neural Network[J]. 数据分析与知识发现, 2021, 5(9): 75-84.
[2] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[3] Che Hongxin,Wang Tong,Wang Wei. Comparing Prediction Models for Prostate Cancer[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[4] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[5] Xu Liangchen, Guo Chonghui. Predicting Survival Rates for Gastric Cancer Based on Ensemble Learning[J]. 数据分析与知识发现, 2021, 5(8): 86-99.
[6] Lu Quan, He Chao, Chen Jing, Tian Min, Liu Ting. A Multi-Label Classification Model with Two-Stage Transfer Learning[J]. 数据分析与知识发现, 2021, 5(7): 91-100.
[7] Xie Hao,Mao Jin,Li Gang. Sentiment Classification of Image-Text Information with Multi-Layer Semantic Fusion[J]. 数据分析与知识发现, 2021, 5(6): 103-114.
[8] Yu Bengong,Zhu Xiaojie,Zhang Ziwei. A Capsule Network Model for Text Classification with Multi-level Feature Extraction[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[9] Meng Zhen,Wang Hao,Yu Wei,Deng Sanhong,Zhang Baolong. Vocal Music Classification Based on Multi-category Feature Fusion[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[10] Wang Nan,Li Hairong,Tan Shuru. Predicting of Public Opinion Reversal with Improved SMOTE Algorithm and Ensemble Learning[J]. 数据分析与知识发现, 2021, 5(4): 37-48.
[11] Qiu Yunfei, Guo Lei. Predicting Diabetic Complications with Unbalanced Data[J]. 数据分析与知识发现, 2021, 5(2): 116-128.
[12] Zhang Mengyao, Zhu Guangli, Zhang Shunxiang, Zhang Biao. Grouping Microblog Users of Trending Topics Based on Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(2): 43-49.
[13] Dong Miao, Su Zhongqi, Zhou Xiaobei, Lan Xue, Cui Zhigang, Cui Lei. Improving PubMedBERT for CID-Entity-Relation Classification Using Text-CNN[J]. 数据分析与知识发现, 2021, 5(11): 145-152.
[14] Feng Hao, Li Shuqing. Multi-layer Cascade Classifier for Credit Scoring with Multiple-Support Vector Machines[J]. 数据分析与知识发现, 2021, 5(10): 28-36.
[15] Wang Yan, Wang Huyan, Yu Bengong. Chinese Text Classification with Feature Fusion[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn