An Under-sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data
Lianjie Xiao(),Mengrui Gao,Xinning Su
School of Information Management, Nanjing University, Nanjing 210023, China Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
[Objective] This paper tries to solve the problem of the low accuracy of minority classification in the binary classification task due to class imbalance. [Methods] An under-sampling ensemble classification algorithm based on fuzzy c-means(FCM) clustering for imbalanced data is proposed. That is, the majority class samples are under-sampled based on FCM clustering, all these cluster center samples and all the minority samples are made up to a balance data set. We use the integrated learning algorithm based on Bagging to classify the balanced data sets. [Results] The Matlab simulation results of experiments on four imbalanced datasets show that the ECFCM algorithm improves Acc, AUC and F1 by up to 5.75%, 13.84% and 7.54%. [Limitations] Some standard data sets are used to verify the effectiveness of ECFCM. When in a specific application, a targeted research on classification algorithm is needed. [Conclusions] The ECFCM algorithm performs good to a certain extent, which is conducive to improve the binary classification accuracy of the minority class on imbalanced datasets.
He H, Garcia E A.Learning from Imbalanced Data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.
[2]
Yang X, Lo D, Huang Q, et al.Automated Identification of High Impact Bug Reports Leveraging Imbalanced Learning Strategies[C]//Proceedings of the 40th IEEE Annual Computer Software and Applications Conference, Atlanta, Georgia,USA. IEEE Press, 2016: 227-232.
[3]
Zakaryazad A, Duman E.A Profit-driven Artificial Neural Network (ANN) with Applications to Fraud Detection and Direct Marketing[J]. Neurocomputing, 2016, 175: 121-131.
[4]
Prusa J D, Khoshgoftaar T M, Seliya N.Enhancing Ensemble Learners with Data Sampling on High-Dimensional Imbalanced Tweet Sentiment Data[C]//Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference(FLAIRS2016), Florida, USA. AAAI Press, 2016: 322-328.
(Fang Lei, Ma Xijun.An Applied Research on Improved Entropy-based SVM Churn Prediction Model[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(6): 643-648.)
[6]
Galar M, Fernandez A, Barrenechea E, et al.A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches[J]. IEEE Transactions on Systems, Man & Cybernetics, Part C:Applications & Reviews, 2012, 42(4): 463-484.
[7]
Liu G, Yang Y, Li B.Fuzzy Rule-based Oversampling Technique for Imbalanced and Incomplete Data Learning[J]. Knowledge-Based Systems, 2018, 158: 154-174.
[8]
Lin W C, Tsai C F, Hu Y H, et al. Clustering-based Undersampling in Class-imbalanced Data[J]. Information Sciences, 2017, 409-410: 17-26.
[9]
Błaszczyński J, Stefanowski J.Neighbourhood Sampling in Bagging for Imbalanced Data[J]. Neurocomputing, 2015, 150: 529-542.
[10]
Batista G E A P A, Prati R C, Monard M C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29.
[11]
Zhang J, Mani I. kNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction [C]// Proceedings of the ICML2003 Workshop on Learning from Imbalanced Datasets, Washington, DC, USA. AAAI Press, 2003: 42-48.
[12]
Cateni S, Colla V, Vannucci M.A Method for Resampling Imbalanced Datasets in Binary Classification Tasks for Real-World Problems[J]. Neurocomputing, 2014, 135: 32-41.
[13]
Ha J, Lee J S.A New Under-Sampling Method Using Genetic Algorithm for Imbalanced Data Classification [C] //Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication, Danang, Vietnam. ACM Press, 2016: Article No.95.
[14]
Kocyigit Y, Seker H.Imbalanced Data Classifier by Using Ensemble Fuzzy C-Means Clustering[C]// Proceedings of the IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2012), Hong Kong, China. IEEE Press, 2012: 952-955.
[15]
Dunn J C.A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-separated Clusters[J]. Journal of Cybernetics, 1973, 3(3): 32-57.
[16]
Bezdek J C, Ehrlich R, Full W.FCM: The Fuzzy C-Means Clustering Algorithm[J]. Computers & Geosciences, 1984, 10(2-3): 191-203.
[17]
蔡静颖. 模糊聚类算法及应用[M]. 北京: 冶金工业出版社, 2015.
[17]
(Cai Jingying.Fuzzy Clustering Algorithm and Applications[M]. Beijing: Metallurgical Industry Press, 2015.)
(Zhang Xiang, Zhou Mingquan, Geng Guohua, et al.Application of Bagging Algorithm to Chinese Text Categorization[J]. Computer Engineering and Applications, 2009, 45(5): 135-137, 179.)