Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (5): 127-136    DOI: 10.11925/infotech.2096-3467.2021.0847
Current Issue | Archive | Adv Search |
Under-sampling Algorithm with Weighted Distance Based on Adaptive K-Means Clustering
Zhou Qian1,Yao Zhen2(),Sun Bo1
1College of Information Science and Engineering, Shandong Agricultural University, Taian 271018, China
2Library of Shandong Agricultural University, Taian 271018, China
Download: PDF (800 KB)   HTML ( 26
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study tries to reduce the impacts of imbalanced data on classification accuracy. [Methods] First, we used the adaptive k-means clustering algorithm to process the majority class and remove the outliers. Then, we calculated the weighted distance between data and the centers of the clusters to sort the weighted distances. We also sequentially sampled the majority class according to the density of the clusters. Finally, we trained the classification algorithm combining of the sampled data and the minority class. [Results] The average max AUC values reached 0.912 with 25 imbalanced datasets, which was at least 0.014 higher than other methods. Our new algorithm’s average running time was 1.377s, and worked well with imbalanced big data sets. [Limitations] The proposed model could not address the multi-classification issues. [Conclusions] This new algorithm could ide.pngy the optimal k-value, detect and remove the outliers, solve class imbalance problem, and improve classification accuracy. It is capable of processing imbalanced large data sets faster and cost-effectively.

Key wordsClass Imbalance      Clustering      Weighted Distance      Undersampling     
Received: 13 August 2021      Published: 21 June 2022
ZTFLH:  TP393  
Fund:Natural Science Foundation of Shandong Province(ZR2018QF002);Library and Information Research Project of Shandong Agricultural University(TQ201902)
Corresponding Authors: Yao Zhen,ORCID:0000-0003-2848-939X     E-mail: yz@sdau.edu.cn

Cite this article:

Zhou Qian, Yao Zhen, Sun Bo. Under-sampling Algorithm with Weighted Distance Based on Adaptive K-Means Clustering. Data Analysis and Knowledge Discovery, 2022, 6(5): 127-136.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0847     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I5/127

Functional Simulation of KCDWU
Flowchart of KCDWU
序号 名称 特征数 数据量/去重后数据量 不平衡率/去重后不平衡率
1 abalone19 8 4 174/4 174 129.44/129.44
2 ecoli0vs1 7 220/220 1.86/1.86
3 ecoli1 7 336/336 3.36/3.36
4 ecoli2 7 336/336 5.46/5.46
5 ecoli3 7 336/336 8.6/8.6
6 glass0 9 214/213 2.06/2.09
7 glass0123vs456 9 214/213 3.2/3.18
8 glass1 9 214/213 1.82/1.8
9 glass2 9 214/213 11.59/11.53
10 haberman 3 306/289 2.78/2.66
11 iris0 4 150/147 2/2.06
12 new-thyroid1 5 215/215 5.14/5.14
13 page-blocks0 10 5472/5 403 8.79/9.51
14 shuttle0vs4 9 1 829/1 829 13.87/13.87
15 vehicle0 18 846/846 3.25/3.25
16 vehicle1 18 846/846 2.89/2.89
17 vehicle2 18 846/846 2.88/2.88
18 vehicle3 18 846/846 2.99/2.99
19 vowel0 13 988/988 9.98/9.98
20 wisconsin 9 683/449 1.86/1.1
21 yeast05679vs4 8 528/527 9.35/9.33
22 yeast1vs7 7 459/455 14.3/14.17
23 yeast2vs8 8 482/457 23.1/21.85
24 yeast3 8 1 484/1 453 8.1/7.97
25 yeast4 8 1 484/1 453 28.1/27.49
26 Letter-A 16 20 000/18 668 24.35/23.69
27 Adult 15 32 561/26 851 3.15/3.39
Data Sets
真实情况 预测结果
阳性 阴性
阳性 TP真阳性 FN假阴性
阴性 FP假阳性 TN真阴性
Confusion Matrix
序号 数据集 k 运行
时间/s
与各分类算法组合后的AUC平均值
Linear SVM RBF SVM Decision Tree Random Forest Logistic Regression AdaBoost
1 abalone19 8 1.747 0.889 0.639 0.597 0.833 0.639 0.708
2 ecoli0vs1 13 0.839 0.964 0.964 0.964 0.964 0.964 0.935
3 ecoli1 2 0.399 0.733 0.769 0.876 0.798 0.727 0.828
4 ecoli2 2 0.365 0.821 0.857 0.893 0.750 0.821 0.786
5 ecoli3 2 0.418 1.000 1.000 1.000 1.000 1.000 0.875
6 glass0 11 0.845 0.822 0.850 0.761 0.844 0.822 0.739
7 glass0123vs456 7 0.618 1.000 1.000 0.967 0.967 1.000 1.000
8 glass1 7 0.586 0.580 0.847 0.798 0.704 0.645 0.828
9 glass2 6 0.530 0.700 0.600 0.800 0.700 0.600 0.800
10 haberman 13 0.830 0.594 0.594 0.563 0.656 0.531 0.656
11 iris0 5 0.269 1.000 1.000 1.000 1.000 1.000 1.000
12 new-thyroid1 11 0.490 0.950 0.950 1.000 1.000 0.950 1.000
13 page-blocks0 17 6.052 0.885 0.964 0.825 0.937 0.882 0.934
14 shuttle0vs4 17 2.310 1.000 1.000 1.000 0.975 1.000 1.000
15 vehicle0 10 1.173 1.000 1.000 0.877 0.989 1.000 1.000
16 vehicle1 30 3.401 0.724 0.753 0.671 0.643 0.716 0.735
17 vehicle2 38 4.388 0.917 0.950 0.879 0.934 0.940 0.947
18 vehicle3 27 2.810 0.814 0.821 0.767 0.682 0.824 0.729
19 vowel0 17 1.721 0.944 1.000 1.000 1.000 0.917 0.944
20 wisconsin 28 2.075 0.966 0.963 0.926 0.963 0.963 0.942
21 yeast05679vs4 2 0.438 0.717 0.717 0.767 0.833 0.767 0.850
22 yeast1vs7 2 0.429 0.833 0.833 0.667 0.833 0.833 0.778
23 yeast2vs8 4 0.520 1.000 1.000 1.000 1.000 1.000 0.833
24 yeast3 2 0.638 0.939 0.954 0.938 0.908 0.954 0.908
25 yeast4 2 0.523 0.750 0.917 0.850 0.850 0.750 0.933
平均值 - 1.377 0.862 0.878 0.855 0.871 0.850 0.868
Experimental Results of Small Sample Data Sets
序号 数据集 UB1 UB4 UB24 RUS1 CBIS CBU KCDWU
平均 最大
1 abalone19 0.695 0.721 0.680 0.631 0.608 0.728 0.718 0.889
2 ecoli0vs1 0.969 0.980 0.980 0.969 0.958 0.982 0.959 0.964
3 ecoli1 0.898 0.900 0.902 0.883 0.921 0.927 0.789 0.876
4 ecoli2 0.870 0.884 0.881 0.899 0.938 0.947 0.821 0.893
5 ecoli3 0.882 0.908 0.894 0.856 0.898 0.926 0.979 1.000
6 glass0 0.818 0.814 0.824 0.813 0.811 0.873 0.806 0.850
7 glass0123vs456 0.894 0.904 0.917 0.930 0.893 0.970 0.989 1.000
8 glass1 0.748 0.737 0.752 0.763 0.788 0.824 0.734 0.847
9 glass2 0.758 0.769 0.706 0.780 0.743 0.760 0.700 0.800
10 haberman 0.658 0.664 0.668 0.655 0.619 0.603 0.599 0.656
11 iris0 0.990 0.990 0.980 0.990 0.970 0.990 1.000 1.000
12 new-thyroid1 0.955 0.964 0.969 0.958 0.973 0.973 0.975 1.000
13 page-blocks0 0.952 0.958 0.959 0.948 0.987 0.986 0.905 0.964
14 shuttle0vs4 1.000 1.000 1.000 1.000 0.999 1.000 0.996 1.000
15 vehicle0 0.945 0.952 0.954 0.958 0.987 0.990 0.978 1.000
16 vehicle1 0.765 0.787 0.761 0.747 0.816 0.832 0.707 0.753
17 vehicle2 0.957 0.964 0.964 0.970 0.995 0.995 0.928 0.950
18 vehicle3 0.764 0.802 0.784 0.765 0.830 0.827 0.773 0.824
19 vowel0 0.944 0.947 0.947 0.943 0.995 0.987 0.968 1.000
20 wisconsin 0.957 0.960 0.971 0.964 0.984 0.990 0.954 0.966
21 yeast05679vs4 0.782 0.794 0.814 0.803 0.849 0.869 0.775 0.850
22 yeast1vs7 0.747 0.786 0.773 0.715 0.765 0.768 0.796 0.833
23 yeast2vs8 0.761 0.783 0.747 0.789 0.879 0.868 0.972 1.000
24 yeast3 0.940 0.934 0.944 0.925 0.958 0.967 0.933 0.954
25 yeast4 0.860 0.855 0.854 0.812 0.878 0.874 0.842 0.933
平均值 0.860 0.870 0.865 0.859 0.882 0.898 0.864 0.912
Comparison of AUC in Different Methods
Optimization on Letter-A
Optimization on Adult
[1] Zhang J, Chen L, Tian J X, et al. Breast Cancer Diagnosis Using Cluster-Based Undersampling and Boosted C5.0 Algorithm[J]. International Journal of Control, Automation and Systems, 2021, 19(5):1998-2008.
doi: 10.1007/s12555-019-1061-x
[2] 邓成越. 高校图书馆社会化服务中用户信用体系研究[J]. 图书情报工作, 2018, 62(23):59-64.
[2] ( Deng Chengyue. Study on User Credit System in the Socialized Service of University Libraries[J]. Library and Information Service, 2018, 62(23):59-64.)
[3] Lin W C, Tsai C F, Hu Y H, et al. Clustering-Based Undersampling in Class-imbalanced Data[J]. Information Sciences, 2017, 409-410:17-26.
doi: 10.1016/j.ins.2017.05.008
[4] Sahin Y, Bulkan S, Duman E. A Cost-Sensitive Decision Tree Approach for Fraud Detection[J]. Expert Systems with Applications, 2013, 40(15): 5916-5923.
doi: 10.1016/j.eswa.2013.05.021
[5] Zheng W J, Zhao H. Cost-Sensitive Hierarchical Classification for Imbalance Classes[J]. Applied Intelligence, 2020, 50(8): 2328-2338.
doi: 10.1007/s10489-019-01624-z
[6] Yu H L, Mu C X, Sun C Y, et al. Support Vector Machine-Based Optimized Decision Threshold Adjustment Strategy for Classifying Imbalanced Data[J]. Knowledge-Based Systems, 2015, 76: 67-78.
doi: 10.1016/j.knosys.2014.12.007
[7] Ohsaki M, Wang P, Matsuda K, et al. Confusion-Matrix-Based Kernel Logistic Regression for Imbalanced Data Classification[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(9): 1806-1819.
doi: 10.1109/TKDE.2017.2682249
[8] Kim M J, Kang D K, Kim H B. Geometric Mean Based Boosting Algorithm with Over-sampling to Resolve Data Imbalance Problem for Bankruptcy Prediction[J]. Expert Systems with Applications, 2015, 42(3): 1074-1082.
doi: 10.1016/j.eswa.2014.08.025
[9] Chawla N V, Lazarevic A, Hall L O, et al. SMOTEBoost: Improving Prediction of the Minority Class in Boosting[C]// Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery. 2003:107-119.
[10] Seiffert C, Khoshgoftaar T M, van Hulse J, et al. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance[J]. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 2010, 40(1): 185-197.
doi: 10.1109/TSMCA.2009.2029559
[11] Sun J, Lang J, Fujita H, et al. Imbalanced Enterprise Credit Evaluation with DTE-SBD: Decision Tree Ensemble Based on SMOTE and Bagging with Differentiated Sampling Rates[J]. Information Sciences, 2018, 425: 76-91.
doi: 10.1016/j.ins.2017.10.017
[12] Zyblewski P, Sabourin R, Woźniak M. Preprocessed Dynamic Classifier Ensemble Selection for Highly Imbalanced Drifted Data Streams[J]. Information Fusion, 2021, 66: 138-154.
doi: 10.1016/j.inffus.2020.09.004
[13] Wang S, Yao X. Diversity Analysis on Imbalanced Data Sets by Using Ensemble Models[C]// Proceedings of 2009 IEEE Symposium on Computational Intelligence and Data Mining. 2009: 324-331.
[14] Galar M, Fernandez A, Barrenechea E, et al. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2012, 42(4): 463-484.
[15] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-sampling Technique[J]. Journal of A.pngicial Intelligence Research, 2002, 16: 321-357.
[16] He H B, Bai Y, Garcia E A, et al. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning[C]// Proceedings of 2008 IEEE International Joint Conference on Neural Networks. 2008: 1322-1328.
[17] Han H, Wang W Y, Mao B H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning[C]// Proceedings of International Conference on Intelligent Computing. 2005: 878-887.
[18] 陆妙芳, 杨有龙. 基于密度峰值聚类和径向基函数的过采样算法[J/OL]. 计算机工程与应用. https://kns.cnki.net/kcms/detail/11.2127.TP.20210521.1100.006.htm.
[18] ( Lu Miaofang, Yang Youlong. Oversampling Algorithm Based on Density Peak Clustering and Radial Basis Function[J/OL]. Computer Engineering and Applications. https://kns.cnki.net/kcms/detail/11.2127.TP.20210521.1100.006.htm.)
[19] Jindaluang W, Chouvatut V, Kantabutra S. Under-Sampling by Algorithm with Performance Guaranteed for Class-imbalance Problem[C]// Proceedings of 2014 International Computer Science and Engineering Conference. 2014: 215-221.
[20] Tsai C F, Lin W C, Hu Y H, et al. Under-Sampling Class Imbalanced Datasets by Combining Clustering Analysis and Instance Selection[J]. Information Sciences, 2019, 477: 47-54.
doi: 10.1016/j.ins.2018.10.029
[21] 肖连杰, 郜梦蕊, 苏新宁. 一种基于模糊C-均值聚类的欠采样集成不平衡数据分类算法[J]. 数据分析与知识发现, 2019, 3(4): 90-96.
[21] ( Xiao Lianjie, Gao Mengrui, Su Xinning. An Under-Sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data[J]. Data Analysis and Knowledge Discovery, 2019, 3(4): 90-96.)
[22] 崔彩霞, 曹付元, 梁吉业. 基于密度峰值聚类的自适应欠采样方法[J]. 模式识别与人工智能, 2020, 33(9): 811-819.
[22] ( Cui Caixia, Cao Fuyuan, Liang Jiye. Adaptive Undersampling Based on Density Peak Clustering[J]. Pattern Recognition and A.pngicial Intelligence, 2020, 33(9): 811-819.)
[23] Yen S J, Lee Y S. Cluster-Based Under-Sampling Approaches for Imbalanced Data Distributions[J]. Expert Systems with Applications, 2009, 36(3): 5718-5727.
doi: 10.1016/j.eswa.2008.06.108
[24] Sobhani P, Viktor H, Matwin S. Learning from Imbalanced Data Using Ensemble Methods and Cluster-Based Undersampling[C]// Proceedings of the 3rd International Conference on New Frontiers in Mining Complex Patterns. 2014: 69-83.
[25] Macqueen J B. Some Methods for Classification and Analysis of Multivariate Observations[C]// Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967: 281-297.
[26] Zseby T. Str.pngication Strategies for Sampling-based Non-intrusive Measurements of One-way Delay[C]// Proceedings of Passive and Active Measurement Workshop (PAM 2003). 2003.
[27] Alcalá-Fdez J, Fernández A, Luengo J, et al. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework[J]. Journal of Multiple-Valued Logic and Soft Computing, 2011, 17(2-3): 255-287.
[28] Dua D, Graff C. UCI Machine Learning Repository[DB/OL]. [2021-07-08]. http://archive.ics.uci.edu/ml.
[29] Spackman K A. Signal Detection Theory: Valuable Tools for Evaluating Inductive Learning[C]// Proceedings of the 6th International Workshop on Machine Learning. 1989: 160-163.
[1] Cui Ji, Zhang Jinpeng, Bao Zhou, Ding Shengchun. Forecasting Developments of Core Topics in Science and Technology with Trend Analysis[J]. 数据分析与知识发现, 2022, 6(9): 1-13.
[2] Zhang Junliang, Fang Xuemei, Zhang Fan, Liu Xiwen, Zhu Peng. Analyzing Medical Semantic Association with Complex Network[J]. 数据分析与知识发现, 2022, 6(9): 125-137.
[3] Wu Jiang, Liu Tao, Liu Yang. Mining Online User Profiles and Self-Presentations: Case Study of NetEase Music Community[J]. 数据分析与知识发现, 2022, 6(7): 56-69.
[4] Xue Jingjing, Qin Yongbin, Huang Ruizhang, Ren Lina, Chen Yanping. SSVAE: A Deep Variational Text Clustering Model with Semantic Supplementation[J]. 数据分析与知识发现, 2022, 6(6): 71-83.
[5] Hu Jiming, Zheng Xiang. Abstracting Interactive Contents from New Media for Government Affairs Based on Topic Clustering[J]. 数据分析与知识发现, 2022, 6(6): 95-104.
[6] Guo Lei, Liu Wenju, Wang Ze, Ren Yueqiang. Point-of-Interest Recommendation with Spectral Clustering and Multi-Factors[J]. 数据分析与知识发现, 2022, 6(5): 77-88.
[7] Nie Hui, Wu Xiaoyan, Lin Yun. Clustering and Characterizing Depression Patients Based on Online Medical Records[J]. 数据分析与知识发现, 2022, 6(2/3): 222-232.
[8] Wu Zhenfeng, Lan Tian, Wang Mengmeng, Pu Mo, Zhang Yu, Liu Zhihui, He Yanqing. Detecting Topics of Online News with Shared Nearest Neighbours and Markov Clustering[J]. 数据分析与知识发现, 2022, 6(10): 103-113.
[9] Feng Xiaodong, Hui Kangxin. Topic Clustering for Social Media Texts with Heterogeneous Graph Neural Networks[J]. 数据分析与知识发现, 2022, 6(10): 9-19.
[10] Wang Xuefeng, Ren Huichao, Liu Yuqin. Visualization Method for Technology Theme Map with Clustering[J]. 数据分析与知识发现, 2022, 6(1): 91-100.
[11] Wang Ruolin, Niu Zhendong, Lin Qika, Zhu Yifan, Qiu Ping, Lu Hao, Liu Donglei. Disambiguating Author Names with Embedding Heterogeneous Information and Attentive RNN Clustering Parameters[J]. 数据分析与知识发现, 2021, 5(8): 13-24.
[12] Wang Xiwei,Jia Ruonan,Wei Yanan,Zhang Liu. Clustering User Groups of Public Opinion Events from Multi-dimensional Social Network[J]. 数据分析与知识发现, 2021, 5(6): 25-35.
[13] Lu Linong,Zhu Zhongming,Zhang Wangqiang,Wang Xiaochun. Cross-database Knowledge Integration and Fingerprint of Institutional Repositories with Lingo3G Clustering Algorithm[J]. 数据分析与知识发现, 2021, 5(5): 127-132.
[14] Zhang Mengyao, Zhu Guangli, Zhang Shunxiang, Zhang Biao. Grouping Microblog Users of Trending Topics Based on Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(2): 43-49.
[15] Ding Hao, Ai Wenhua, Hu Guangwei, Li Shuqing, Suo Wei. A Personalized Recommendation Model with Time Series Fluctuation of User Interest[J]. 数据分析与知识发现, 2021, 5(11): 45-58.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn