Under-sampling Algorithm with Weighted Distance Based on Adaptive K-Means Clustering
Zhou Qian1,Yao Zhen2(),Sun Bo1
1College of Information Science and Engineering, Shandong Agricultural University, Taian 271018, China 2Library of Shandong Agricultural University, Taian 271018, China
[Objective] This study tries to reduce the impacts of imbalanced data on classification accuracy. [Methods] First, we used the adaptive k-means clustering algorithm to process the majority class and remove the outliers. Then, we calculated the weighted distance between data and the centers of the clusters to sort the weighted distances. We also sequentially sampled the majority class according to the density of the clusters. Finally, we trained the classification algorithm combining of the sampled data and the minority class. [Results] The average max AUC values reached 0.912 with 25 imbalanced datasets, which was at least 0.014 higher than other methods. Our new algorithm’s average running time was 1.377s, and worked well with imbalanced big data sets. [Limitations] The proposed model could not address the multi-classification issues. [Conclusions] This new algorithm could ide.pngy the optimal k-value, detect and remove the outliers, solve class imbalance problem, and improve classification accuracy. It is capable of processing imbalanced large data sets faster and cost-effectively.
周倩, 姚震, 孙博. 基于自适应k均值聚类的距离加权欠采样算法*[J]. 数据分析与知识发现, 2022, 6(5): 127-136.
Zhou Qian, Yao Zhen, Sun Bo. Under-sampling Algorithm with Weighted Distance Based on Adaptive K-Means Clustering. Data Analysis and Knowledge Discovery, 2022, 6(5): 127-136.
Zhang J, Chen L, Tian J X, et al. Breast Cancer Diagnosis Using Cluster-Based Undersampling and Boosted C5.0 Algorithm[J]. International Journal of Control, Automation and Systems, 2021, 19(5):1998-2008.
doi: 10.1007/s12555-019-1061-x
( Deng Chengyue. Study on User Credit System in the Socialized Service of University Libraries[J]. Library and Information Service, 2018, 62(23):59-64.)
[3]
Lin W C, Tsai C F, Hu Y H, et al. Clustering-Based Undersampling in Class-imbalanced Data[J]. Information Sciences, 2017, 409-410:17-26.
doi: 10.1016/j.ins.2017.05.008
[4]
Sahin Y, Bulkan S, Duman E. A Cost-Sensitive Decision Tree Approach for Fraud Detection[J]. Expert Systems with Applications, 2013, 40(15): 5916-5923.
doi: 10.1016/j.eswa.2013.05.021
[5]
Zheng W J, Zhao H. Cost-Sensitive Hierarchical Classification for Imbalance Classes[J]. Applied Intelligence, 2020, 50(8): 2328-2338.
doi: 10.1007/s10489-019-01624-z
[6]
Yu H L, Mu C X, Sun C Y, et al. Support Vector Machine-Based Optimized Decision Threshold Adjustment Strategy for Classifying Imbalanced Data[J]. Knowledge-Based Systems, 2015, 76: 67-78.
doi: 10.1016/j.knosys.2014.12.007
[7]
Ohsaki M, Wang P, Matsuda K, et al. Confusion-Matrix-Based Kernel Logistic Regression for Imbalanced Data Classification[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(9): 1806-1819.
doi: 10.1109/TKDE.2017.2682249
[8]
Kim M J, Kang D K, Kim H B. Geometric Mean Based Boosting Algorithm with Over-sampling to Resolve Data Imbalance Problem for Bankruptcy Prediction[J]. Expert Systems with Applications, 2015, 42(3): 1074-1082.
doi: 10.1016/j.eswa.2014.08.025
[9]
Chawla N V, Lazarevic A, Hall L O, et al. SMOTEBoost: Improving Prediction of the Minority Class in Boosting[C]// Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery. 2003:107-119.
[10]
Seiffert C, Khoshgoftaar T M, van Hulse J, et al. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance[J]. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 2010, 40(1): 185-197.
doi: 10.1109/TSMCA.2009.2029559
[11]
Sun J, Lang J, Fujita H, et al. Imbalanced Enterprise Credit Evaluation with DTE-SBD: Decision Tree Ensemble Based on SMOTE and Bagging with Differentiated Sampling Rates[J]. Information Sciences, 2018, 425: 76-91.
doi: 10.1016/j.ins.2017.10.017
[12]
Zyblewski P, Sabourin R, Woźniak M. Preprocessed Dynamic Classifier Ensemble Selection for Highly Imbalanced Drifted Data Streams[J]. Information Fusion, 2021, 66: 138-154.
doi: 10.1016/j.inffus.2020.09.004
[13]
Wang S, Yao X. Diversity Analysis on Imbalanced Data Sets by Using Ensemble Models[C]// Proceedings of 2009 IEEE Symposium on Computational Intelligence and Data Mining. 2009: 324-331.
[14]
Galar M, Fernandez A, Barrenechea E, et al. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2012, 42(4): 463-484.
[15]
Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-sampling Technique[J]. Journal of A.pngicial Intelligence Research, 2002, 16: 321-357.
[16]
He H B, Bai Y, Garcia E A, et al. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning[C]// Proceedings of 2008 IEEE International Joint Conference on Neural Networks. 2008: 1322-1328.
[17]
Han H, Wang W Y, Mao B H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning[C]// Proceedings of International Conference on Intelligent Computing. 2005: 878-887.
( Lu Miaofang, Yang Youlong. Oversampling Algorithm Based on Density Peak Clustering and Radial Basis Function[J/OL]. Computer Engineering and Applications. https://kns.cnki.net/kcms/detail/11.2127.TP.20210521.1100.006.htm.)
[19]
Jindaluang W, Chouvatut V, Kantabutra S. Under-Sampling by Algorithm with Performance Guaranteed for Class-imbalance Problem[C]// Proceedings of 2014 International Computer Science and Engineering Conference. 2014: 215-221.
[20]
Tsai C F, Lin W C, Hu Y H, et al. Under-Sampling Class Imbalanced Datasets by Combining Clustering Analysis and Instance Selection[J]. Information Sciences, 2019, 477: 47-54.
doi: 10.1016/j.ins.2018.10.029
( Xiao Lianjie, Gao Mengrui, Su Xinning. An Under-Sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data[J]. Data Analysis and Knowledge Discovery, 2019, 3(4): 90-96.)
( Cui Caixia, Cao Fuyuan, Liang Jiye. Adaptive Undersampling Based on Density Peak Clustering[J]. Pattern Recognition and A.pngicial Intelligence, 2020, 33(9): 811-819.)
[23]
Yen S J, Lee Y S. Cluster-Based Under-Sampling Approaches for Imbalanced Data Distributions[J]. Expert Systems with Applications, 2009, 36(3): 5718-5727.
doi: 10.1016/j.eswa.2008.06.108
[24]
Sobhani P, Viktor H, Matwin S. Learning from Imbalanced Data Using Ensemble Methods and Cluster-Based Undersampling[C]// Proceedings of the 3rd International Conference on New Frontiers in Mining Complex Patterns. 2014: 69-83.
[25]
Macqueen J B. Some Methods for Classification and Analysis of Multivariate Observations[C]// Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967: 281-297.
[26]
Zseby T. Str.pngication Strategies for Sampling-based Non-intrusive Measurements of One-way Delay[C]// Proceedings of Passive and Active Measurement Workshop (PAM 2003). 2003.
[27]
Alcalá-Fdez J, Fernández A, Luengo J, et al. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework[J]. Journal of Multiple-Valued Logic and Soft Computing, 2011, 17(2-3): 255-287.
[28]
Dua D, Graff C. UCI Machine Learning Repository[DB/OL]. [2021-07-08]. http://archive.ics.uci.edu/ml.
[29]
Spackman K A. Signal Detection Theory: Valuable Tools for Evaluating Inductive Learning[C]// Proceedings of the 6th International Workshop on Machine Learning. 1989: 160-163.