Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (4): 90-96    DOI: 10.11925/infotech.2096-3467.2018.0533
Current Issue | Archive | Adv Search |
An Under-sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data
Lianjie Xiao(),Mengrui Gao,Xinning Su
School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
Download: PDF(973 KB)   HTML ( 2
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to solve the problem of the low accuracy of minority classification in the binary classification task due to class imbalance. [Methods] An under-sampling ensemble classification algorithm based on fuzzy c-means(FCM) clustering for imbalanced data is proposed. That is, the majority class samples are under-sampled based on FCM clustering, all these cluster center samples and all the minority samples are made up to a balance data set. We use the integrated learning algorithm based on Bagging to classify the balanced data sets. [Results] The Matlab simulation results of experiments on four imbalanced datasets show that the ECFCM algorithm improves Acc, AUC and F1 by up to 5.75%, 13.84% and 7.54%. [Limitations] Some standard data sets are used to verify the effectiveness of ECFCM. When in a specific application, a targeted research on classification algorithm is needed. [Conclusions] The ECFCM algorithm performs good to a certain extent, which is conducive to improve the binary classification accuracy of the minority class on imbalanced datasets.

Key wordsImbalanced Data      Fuzzy C-Means Clustering      Classification      Under-sampling      Ensemble Learning     
Received: 11 May 2018      Published: 29 May 2019

Cite this article:

Lianjie Xiao,Mengrui Gao,Xinning Su. An Under-sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data. Data Analysis and Knowledge Discovery, 2019, 3(4): 90-96.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.0533     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I4/90

[1] He H, Garcia E A.Learning from Imbalanced Data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.
[2] Yang X, Lo D, Huang Q, et al.Automated Identification of High Impact Bug Reports Leveraging Imbalanced Learning Strategies[C]//Proceedings of the 40th IEEE Annual Computer Software and Applications Conference, Atlanta, Georgia,USA. IEEE Press, 2016: 227-232.
[3] Zakaryazad A, Duman E.A Profit-driven Artificial Neural Network (ANN) with Applications to Fraud Detection and Direct Marketing[J]. Neurocomputing, 2016, 175: 121-131.
[4] Prusa J D, Khoshgoftaar T M, Seliya N.Enhancing Ensemble Learners with Data Sampling on High-Dimensional Imbalanced Tweet Sentiment Data[C]//Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference(FLAIRS2016), Florida, USA. AAAI Press, 2016: 322-328.
[5] 方磊, 马溪骏. 基于信息熵的改进型支持向量机客户流失预测模型应用研究[J]. 情报学报, 2011, 30(6):643-648.
[5] (Fang Lei, Ma Xijun.An Applied Research on Improved Entropy-based SVM Churn Prediction Model[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(6): 643-648.)
[6] Galar M, Fernandez A, Barrenechea E, et al.A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches[J]. IEEE Transactions on Systems, Man & Cybernetics, Part C:Applications & Reviews, 2012, 42(4): 463-484.
[7] Liu G, Yang Y, Li B.Fuzzy Rule-based Oversampling Technique for Imbalanced and Incomplete Data Learning[J]. Knowledge-Based Systems, 2018, 158: 154-174.
[8] Lin W C, Tsai C F, Hu Y H, et al. Clustering-based Undersampling in Class-imbalanced Data[J]. Information Sciences, 2017, 409-410: 17-26.
[9] Błaszczyński J, Stefanowski J.Neighbourhood Sampling in Bagging for Imbalanced Data[J]. Neurocomputing, 2015, 150: 529-542.
[10] Batista G E A P A, Prati R C, Monard M C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29.
[11] Zhang J, Mani I. kNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction [C]// Proceedings of the ICML2003 Workshop on Learning from Imbalanced Datasets, Washington, DC, USA. AAAI Press, 2003: 42-48.
[12] Cateni S, Colla V, Vannucci M.A Method for Resampling Imbalanced Datasets in Binary Classification Tasks for Real-World Problems[J]. Neurocomputing, 2014, 135: 32-41.
[13] Ha J, Lee J S.A New Under-Sampling Method Using Genetic Algorithm for Imbalanced Data Classification [C] //Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication, Danang, Vietnam. ACM Press, 2016: Article No.95.
[14] Kocyigit Y, Seker H.Imbalanced Data Classifier by Using Ensemble Fuzzy C-Means Clustering[C]// Proceedings of the IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2012), Hong Kong, China. IEEE Press, 2012: 952-955.
[15] Dunn J C.A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-separated Clusters[J]. Journal of Cybernetics, 1973, 3(3): 32-57.
[16] Bezdek J C, Ehrlich R, Full W.FCM: The Fuzzy C-Means Clustering Algorithm[J]. Computers & Geosciences, 1984, 10(2-3): 191-203.
[17] 蔡静颖. 模糊聚类算法及应用[M]. 北京: 冶金工业出版社, 2015.
[17] (Cai Jingying.Fuzzy Clustering Algorithm and Applications[M]. Beijing: Metallurgical Industry Press, 2015.)
[18] 张翔, 周明全, 耿国华, 等. Bagging算法在中文文本分类中的应用[J]. 计算机工程与应用, 2009, 45(5): 135-137, 179.
[18] (Zhang Xiang, Zhou Mingquan, Geng Guohua, et al.Application of Bagging Algorithm to Chinese Text Categorization[J]. Computer Engineering and Applications, 2009, 45(5): 135-137, 179.)
[19] 沈学华, 周志华, 吴建鑫, 等. Boosting和Bagging综述[J]. 计算机工程与应用, 2000, 36(12): 31-32, 40.
[19] (Shen Xuehua, Zhou Zhihua, Wu Jianxin, et al.Survey of Boosting and Bagging[J]. Computer Engineering and Applications, 2000, 36(12): 31-32, 40.)
[20] 毛国君, 段立娟. 数据挖掘原理与算法 [M]. 第3版. 北京:清华大学出版社, 2016.
[20] (Mao Guojun, Duan Lijuan.The Principle and Algorithm of Data Mining [M]. The Third Edition. Beijing: Tsinghua University Press, 2016.)
[1] Ru Li,Rui Li,Jie Jiang,Huayi Wu. Spatio-Temporal Characteristics of WMTS Access Sessions[J]. 数据分析与知识发现, 2019, 3(6): 1-11.
[2] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[3] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[4] Qingqing Zhang,Xingshi He,Huimin Wang,Shengjun Meng. Text Sentiment Classification Based on Deep Belief Network[J]. 数据分析与知识发现, 2019, 3(4): 71-79.
[5] Sisi Gui,Wei Lu,Xiaojuan Zhang. Temporal Intent Classification with Query Expression Feature[J]. 数据分析与知识发现, 2019, 3(3): 66-75.
[6] Xiang Xue,Yuxiang Zhao. Exploring User Mental Models of Online Music Classification System: Case Study of College Students[J]. 数据分析与知识发现, 2019, 3(2): 1-12.
[7] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[8] Hui Li,Yaqing Chai. Fine-Grained Sentiment Analysis Based on Convolutional Neural Network[J]. 数据分析与知识发现, 2019, 3(1): 95-103.
[9] Jiehua Wu,Jing Shen,Bei Zhou. Classifying Multilayer Social Network Links Based on Transfer Component Analysis[J]. 数据分析与知识发现, 2018, 2(9): 88-99.
[10] Xinlei Li,Hao Wang,Xiaomin Liu,Sanhong Deng. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[11] Longjia Jia,Bangzuo Zhang. Classifying Topics of Internet Public Opinion from College Students: Case Study of Sina Weibo[J]. 数据分析与知识发现, 2018, 2(7): 55-62.
[12] Lin Li,Hui Li. Computing Text Similarity Based on Concept Vector Space[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[13] Tingxin Wen,Yangzi Li,Jingshuang Sun. Extracting Text Features with Improved Fruit Fly Optimization Algorithm[J]. 数据分析与知识发现, 2018, 2(5): 59-69.
[14] Cuiqing Jiang,Kailun Song,Yong Ding,Yao Liu. Identifying Potential Customers Based on User-Generated Contents[J]. 数据分析与知识发现, 2018, 2(3): 1-8.
[15] Liu Liu,Dongbo Wang. Identifying Interdisciplinary Social Science Research Based on Article Classification[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn