Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (5): 127-136     https://doi.org/10.11925/infotech.2096-3467.2021.0847
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于自适应k均值聚类的距离加权欠采样算法*
周倩1,姚震2(),孙博1
1山东农业大学信息科学与工程学院 泰安 271018
2山东农业大学图书馆 泰安 271018
Under-sampling Algorithm with Weighted Distance Based on Adaptive K-Means Clustering
Zhou Qian1,Yao Zhen2(),Sun Bo1
1College of Information Science and Engineering, Shandong Agricultural University, Taian 271018, China
2Library of Shandong Agricultural University, Taian 271018, China
全文: PDF (800 KB)   HTML ( 26
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 消除分类问题中类不平衡数据对分类精度的影响。【方法】 首先,使用自适应k均值聚类算法对多数类数据集进行聚类,找到并删除离群点;其次,计算数据与聚类中心加权距离并排序,根据簇密度对多数类数据顺序采样;最后,将采样得到的数据与少数类数据集合并,输入分类算法进行训练。【结果】 实验结果表明,在25组不平衡数据集上算法最大AUC平均值达到0.912,相比较于其他方法最少提升了0.014,平均运行时间仅为1.377 s;应用在两组不平衡大数据集上,算法也有很好的表现。【局限】 不适合多分类问题,仅适合解决二分类问题。【结论】 算法能够找到最适k值,检测并删除离群点,解决类不平衡问题,提高分类精度。算法速度快,开销小,适合不平衡大数据集的应用。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
周倩
姚震
孙博
关键词 类不平衡聚类距离加权欠采样    
Abstract

[Objective] This study tries to reduce the impacts of imbalanced data on classification accuracy. [Methods] First, we used the adaptive k-means clustering algorithm to process the majority class and remove the outliers. Then, we calculated the weighted distance between data and the centers of the clusters to sort the weighted distances. We also sequentially sampled the majority class according to the density of the clusters. Finally, we trained the classification algorithm combining of the sampled data and the minority class. [Results] The average max AUC values reached 0.912 with 25 imbalanced datasets, which was at least 0.014 higher than other methods. Our new algorithm’s average running time was 1.377s, and worked well with imbalanced big data sets. [Limitations] The proposed model could not address the multi-classification issues. [Conclusions] This new algorithm could ide.pngy the optimal k-value, detect and remove the outliers, solve class imbalance problem, and improve classification accuracy. It is capable of processing imbalanced large data sets faster and cost-effectively.

Key wordsClass Imbalance    Clustering    Weighted Distance    Undersampling
收稿日期: 2021-08-13      出版日期: 2022-06-21
ZTFLH:  TP393  
基金资助:*山东省自然科学基金青年基金项目(ZR2018QF002);山东农业大学图书情报研究项目的研究成果之一(TQ201902)
通讯作者: 姚震,ORCID:0000-0003-2848-939X     E-mail: yz@sdau.edu.cn
引用本文:   
周倩, 姚震, 孙博. 基于自适应k均值聚类的距离加权欠采样算法*[J]. 数据分析与知识发现, 2022, 6(5): 127-136.
Zhou Qian, Yao Zhen, Sun Bo. Under-sampling Algorithm with Weighted Distance Based on Adaptive K-Means Clustering. Data Analysis and Knowledge Discovery, 2022, 6(5): 127-136.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0847      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I5/127
Fig.1  KCDWU算法功能模拟图
Fig.2  KCDWU算法流程
序号 名称 特征数 数据量/去重后数据量 不平衡率/去重后不平衡率
1 abalone19 8 4 174/4 174 129.44/129.44
2 ecoli0vs1 7 220/220 1.86/1.86
3 ecoli1 7 336/336 3.36/3.36
4 ecoli2 7 336/336 5.46/5.46
5 ecoli3 7 336/336 8.6/8.6
6 glass0 9 214/213 2.06/2.09
7 glass0123vs456 9 214/213 3.2/3.18
8 glass1 9 214/213 1.82/1.8
9 glass2 9 214/213 11.59/11.53
10 haberman 3 306/289 2.78/2.66
11 iris0 4 150/147 2/2.06
12 new-thyroid1 5 215/215 5.14/5.14
13 page-blocks0 10 5472/5 403 8.79/9.51
14 shuttle0vs4 9 1 829/1 829 13.87/13.87
15 vehicle0 18 846/846 3.25/3.25
16 vehicle1 18 846/846 2.89/2.89
17 vehicle2 18 846/846 2.88/2.88
18 vehicle3 18 846/846 2.99/2.99
19 vowel0 13 988/988 9.98/9.98
20 wisconsin 9 683/449 1.86/1.1
21 yeast05679vs4 8 528/527 9.35/9.33
22 yeast1vs7 7 459/455 14.3/14.17
23 yeast2vs8 8 482/457 23.1/21.85
24 yeast3 8 1 484/1 453 8.1/7.97
25 yeast4 8 1 484/1 453 28.1/27.49
26 Letter-A 16 20 000/18 668 24.35/23.69
27 Adult 15 32 561/26 851 3.15/3.39
Table 1  数据集详细信息
真实情况 预测结果
阳性 阴性
阳性 TP真阳性 FN假阴性
阴性 FP假阳性 TN真阴性
Table 2  混淆矩阵
序号 数据集 k 运行
时间/s
与各分类算法组合后的AUC平均值
Linear SVM RBF SVM Decision Tree Random Forest Logistic Regression AdaBoost
1 abalone19 8 1.747 0.889 0.639 0.597 0.833 0.639 0.708
2 ecoli0vs1 13 0.839 0.964 0.964 0.964 0.964 0.964 0.935
3 ecoli1 2 0.399 0.733 0.769 0.876 0.798 0.727 0.828
4 ecoli2 2 0.365 0.821 0.857 0.893 0.750 0.821 0.786
5 ecoli3 2 0.418 1.000 1.000 1.000 1.000 1.000 0.875
6 glass0 11 0.845 0.822 0.850 0.761 0.844 0.822 0.739
7 glass0123vs456 7 0.618 1.000 1.000 0.967 0.967 1.000 1.000
8 glass1 7 0.586 0.580 0.847 0.798 0.704 0.645 0.828
9 glass2 6 0.530 0.700 0.600 0.800 0.700 0.600 0.800
10 haberman 13 0.830 0.594 0.594 0.563 0.656 0.531 0.656
11 iris0 5 0.269 1.000 1.000 1.000 1.000 1.000 1.000
12 new-thyroid1 11 0.490 0.950 0.950 1.000 1.000 0.950 1.000
13 page-blocks0 17 6.052 0.885 0.964 0.825 0.937 0.882 0.934
14 shuttle0vs4 17 2.310 1.000 1.000 1.000 0.975 1.000 1.000
15 vehicle0 10 1.173 1.000 1.000 0.877 0.989 1.000 1.000
16 vehicle1 30 3.401 0.724 0.753 0.671 0.643 0.716 0.735
17 vehicle2 38 4.388 0.917 0.950 0.879 0.934 0.940 0.947
18 vehicle3 27 2.810 0.814 0.821 0.767 0.682 0.824 0.729
19 vowel0 17 1.721 0.944 1.000 1.000 1.000 0.917 0.944
20 wisconsin 28 2.075 0.966 0.963 0.926 0.963 0.963 0.942
21 yeast05679vs4 2 0.438 0.717 0.717 0.767 0.833 0.767 0.850
22 yeast1vs7 2 0.429 0.833 0.833 0.667 0.833 0.833 0.778
23 yeast2vs8 4 0.520 1.000 1.000 1.000 1.000 1.000 0.833
24 yeast3 2 0.638 0.939 0.954 0.938 0.908 0.954 0.908
25 yeast4 2 0.523 0.750 0.917 0.850 0.850 0.750 0.933
平均值 - 1.377 0.862 0.878 0.855 0.871 0.850 0.868
Table 3  小样本数据集实验结果
序号 数据集 UB1 UB4 UB24 RUS1 CBIS CBU KCDWU
平均 最大
1 abalone19 0.695 0.721 0.680 0.631 0.608 0.728 0.718 0.889
2 ecoli0vs1 0.969 0.980 0.980 0.969 0.958 0.982 0.959 0.964
3 ecoli1 0.898 0.900 0.902 0.883 0.921 0.927 0.789 0.876
4 ecoli2 0.870 0.884 0.881 0.899 0.938 0.947 0.821 0.893
5 ecoli3 0.882 0.908 0.894 0.856 0.898 0.926 0.979 1.000
6 glass0 0.818 0.814 0.824 0.813 0.811 0.873 0.806 0.850
7 glass0123vs456 0.894 0.904 0.917 0.930 0.893 0.970 0.989 1.000
8 glass1 0.748 0.737 0.752 0.763 0.788 0.824 0.734 0.847
9 glass2 0.758 0.769 0.706 0.780 0.743 0.760 0.700 0.800
10 haberman 0.658 0.664 0.668 0.655 0.619 0.603 0.599 0.656
11 iris0 0.990 0.990 0.980 0.990 0.970 0.990 1.000 1.000
12 new-thyroid1 0.955 0.964 0.969 0.958 0.973 0.973 0.975 1.000
13 page-blocks0 0.952 0.958 0.959 0.948 0.987 0.986 0.905 0.964
14 shuttle0vs4 1.000 1.000 1.000 1.000 0.999 1.000 0.996 1.000
15 vehicle0 0.945 0.952 0.954 0.958 0.987 0.990 0.978 1.000
16 vehicle1 0.765 0.787 0.761 0.747 0.816 0.832 0.707 0.753
17 vehicle2 0.957 0.964 0.964 0.970 0.995 0.995 0.928 0.950
18 vehicle3 0.764 0.802 0.784 0.765 0.830 0.827 0.773 0.824
19 vowel0 0.944 0.947 0.947 0.943 0.995 0.987 0.968 1.000
20 wisconsin 0.957 0.960 0.971 0.964 0.984 0.990 0.954 0.966
21 yeast05679vs4 0.782 0.794 0.814 0.803 0.849 0.869 0.775 0.850
22 yeast1vs7 0.747 0.786 0.773 0.715 0.765 0.768 0.796 0.833
23 yeast2vs8 0.761 0.783 0.747 0.789 0.879 0.868 0.972 1.000
24 yeast3 0.940 0.934 0.944 0.925 0.958 0.967 0.933 0.954
25 yeast4 0.860 0.855 0.854 0.812 0.878 0.874 0.842 0.933
平均值 0.860 0.870 0.865 0.859 0.882 0.898 0.864 0.912
Table 4  不同方法的AUC值
Fig.3  Letter-A数据集上的优化
Fig.4  Adult数据集上的优化
[1] Zhang J, Chen L, Tian J X, et al. Breast Cancer Diagnosis Using Cluster-Based Undersampling and Boosted C5.0 Algorithm[J]. International Journal of Control, Automation and Systems, 2021, 19(5):1998-2008.
doi: 10.1007/s12555-019-1061-x
[2] 邓成越. 高校图书馆社会化服务中用户信用体系研究[J]. 图书情报工作, 2018, 62(23):59-64.
[2] ( Deng Chengyue. Study on User Credit System in the Socialized Service of University Libraries[J]. Library and Information Service, 2018, 62(23):59-64.)
[3] Lin W C, Tsai C F, Hu Y H, et al. Clustering-Based Undersampling in Class-imbalanced Data[J]. Information Sciences, 2017, 409-410:17-26.
doi: 10.1016/j.ins.2017.05.008
[4] Sahin Y, Bulkan S, Duman E. A Cost-Sensitive Decision Tree Approach for Fraud Detection[J]. Expert Systems with Applications, 2013, 40(15): 5916-5923.
doi: 10.1016/j.eswa.2013.05.021
[5] Zheng W J, Zhao H. Cost-Sensitive Hierarchical Classification for Imbalance Classes[J]. Applied Intelligence, 2020, 50(8): 2328-2338.
doi: 10.1007/s10489-019-01624-z
[6] Yu H L, Mu C X, Sun C Y, et al. Support Vector Machine-Based Optimized Decision Threshold Adjustment Strategy for Classifying Imbalanced Data[J]. Knowledge-Based Systems, 2015, 76: 67-78.
doi: 10.1016/j.knosys.2014.12.007
[7] Ohsaki M, Wang P, Matsuda K, et al. Confusion-Matrix-Based Kernel Logistic Regression for Imbalanced Data Classification[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(9): 1806-1819.
doi: 10.1109/TKDE.2017.2682249
[8] Kim M J, Kang D K, Kim H B. Geometric Mean Based Boosting Algorithm with Over-sampling to Resolve Data Imbalance Problem for Bankruptcy Prediction[J]. Expert Systems with Applications, 2015, 42(3): 1074-1082.
doi: 10.1016/j.eswa.2014.08.025
[9] Chawla N V, Lazarevic A, Hall L O, et al. SMOTEBoost: Improving Prediction of the Minority Class in Boosting[C]// Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery. 2003:107-119.
[10] Seiffert C, Khoshgoftaar T M, van Hulse J, et al. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance[J]. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 2010, 40(1): 185-197.
doi: 10.1109/TSMCA.2009.2029559
[11] Sun J, Lang J, Fujita H, et al. Imbalanced Enterprise Credit Evaluation with DTE-SBD: Decision Tree Ensemble Based on SMOTE and Bagging with Differentiated Sampling Rates[J]. Information Sciences, 2018, 425: 76-91.
doi: 10.1016/j.ins.2017.10.017
[12] Zyblewski P, Sabourin R, Woźniak M. Preprocessed Dynamic Classifier Ensemble Selection for Highly Imbalanced Drifted Data Streams[J]. Information Fusion, 2021, 66: 138-154.
doi: 10.1016/j.inffus.2020.09.004
[13] Wang S, Yao X. Diversity Analysis on Imbalanced Data Sets by Using Ensemble Models[C]// Proceedings of 2009 IEEE Symposium on Computational Intelligence and Data Mining. 2009: 324-331.
[14] Galar M, Fernandez A, Barrenechea E, et al. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2012, 42(4): 463-484.
[15] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-sampling Technique[J]. Journal of A.pngicial Intelligence Research, 2002, 16: 321-357.
[16] He H B, Bai Y, Garcia E A, et al. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning[C]// Proceedings of 2008 IEEE International Joint Conference on Neural Networks. 2008: 1322-1328.
[17] Han H, Wang W Y, Mao B H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning[C]// Proceedings of International Conference on Intelligent Computing. 2005: 878-887.
[18] 陆妙芳, 杨有龙. 基于密度峰值聚类和径向基函数的过采样算法[J/OL]. 计算机工程与应用. https://kns.cnki.net/kcms/detail/11.2127.TP.20210521.1100.006.htm.
[18] ( Lu Miaofang, Yang Youlong. Oversampling Algorithm Based on Density Peak Clustering and Radial Basis Function[J/OL]. Computer Engineering and Applications. https://kns.cnki.net/kcms/detail/11.2127.TP.20210521.1100.006.htm.)
[19] Jindaluang W, Chouvatut V, Kantabutra S. Under-Sampling by Algorithm with Performance Guaranteed for Class-imbalance Problem[C]// Proceedings of 2014 International Computer Science and Engineering Conference. 2014: 215-221.
[20] Tsai C F, Lin W C, Hu Y H, et al. Under-Sampling Class Imbalanced Datasets by Combining Clustering Analysis and Instance Selection[J]. Information Sciences, 2019, 477: 47-54.
doi: 10.1016/j.ins.2018.10.029
[21] 肖连杰, 郜梦蕊, 苏新宁. 一种基于模糊C-均值聚类的欠采样集成不平衡数据分类算法[J]. 数据分析与知识发现, 2019, 3(4): 90-96.
[21] ( Xiao Lianjie, Gao Mengrui, Su Xinning. An Under-Sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data[J]. Data Analysis and Knowledge Discovery, 2019, 3(4): 90-96.)
[22] 崔彩霞, 曹付元, 梁吉业. 基于密度峰值聚类的自适应欠采样方法[J]. 模式识别与人工智能, 2020, 33(9): 811-819.
[22] ( Cui Caixia, Cao Fuyuan, Liang Jiye. Adaptive Undersampling Based on Density Peak Clustering[J]. Pattern Recognition and A.pngicial Intelligence, 2020, 33(9): 811-819.)
[23] Yen S J, Lee Y S. Cluster-Based Under-Sampling Approaches for Imbalanced Data Distributions[J]. Expert Systems with Applications, 2009, 36(3): 5718-5727.
doi: 10.1016/j.eswa.2008.06.108
[24] Sobhani P, Viktor H, Matwin S. Learning from Imbalanced Data Using Ensemble Methods and Cluster-Based Undersampling[C]// Proceedings of the 3rd International Conference on New Frontiers in Mining Complex Patterns. 2014: 69-83.
[25] Macqueen J B. Some Methods for Classification and Analysis of Multivariate Observations[C]// Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967: 281-297.
[26] Zseby T. Str.pngication Strategies for Sampling-based Non-intrusive Measurements of One-way Delay[C]// Proceedings of Passive and Active Measurement Workshop (PAM 2003). 2003.
[27] Alcalá-Fdez J, Fernández A, Luengo J, et al. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework[J]. Journal of Multiple-Valued Logic and Soft Computing, 2011, 17(2-3): 255-287.
[28] Dua D, Graff C. UCI Machine Learning Repository[DB/OL]. [2021-07-08]. http://archive.ics.uci.edu/ml.
[29] Spackman K A. Signal Detection Theory: Valuable Tools for Evaluating Inductive Learning[C]// Proceedings of the 6th International Workshop on Machine Learning. 1989: 160-163.
[1] 崔骥, 张金鹏, 包舟, 丁晟春. 基于趋势度分析的科技领域核心主题发展预测*[J]. 数据分析与知识发现, 2022, 6(9): 1-13.
[2] 张军亮, 方雪梅, 张帆, 刘喜文, 朱鹏. 基于复杂网络的医学语义关联研究*[J]. 数据分析与知识发现, 2022, 6(9): 125-137.
[3] 吴江, 刘涛, 刘洋. 在线社区用户画像及自我呈现主题挖掘——以网易云音乐社区为例*[J]. 数据分析与知识发现, 2022, 6(7): 56-69.
[4] 薛菁菁, 秦永彬, 黄瑞章, 任丽娜, 陈艳平. SSVAE:一种补充语义信息的深度变分文本聚类模型*[J]. 数据分析与知识发现, 2022, 6(6): 71-83.
[5] 胡吉明, 郑翔. 基于主题聚类的新媒体政务互动内容摘要生成研究*[J]. 数据分析与知识发现, 2022, 6(6): 95-104.
[6] 郭蕾, 刘文菊, 王赜, 任悦强. 融合谱聚类和多因素影响的兴趣点推荐方法*[J]. 数据分析与知识发现, 2022, 6(5): 77-88.
[7] 聂卉, 吴晓燕, 林芸. 基于在线问诊记录的抑郁症病患群组划分与特征分析*[J]. 数据分析与知识发现, 2022, 6(2/3): 222-232.
[8] 钱旦敏, 曾婷婷, 常侍艺. 突发公共卫生事件下基于在线健康社区用户画像的用户角色研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 93-104.
[9] 吴振峰, 兰天, 王猛猛, 浦墨, 张昱, 刘志辉, 何彦青. 基于共享最近邻和马尔科夫聚类的网络新闻话题检测方法*[J]. 数据分析与知识发现, 2022, 6(10): 103-113.
[10] 冯小东, 惠康欣. 基于异构图神经网络的社交媒体文本主题聚类*[J]. 数据分析与知识发现, 2022, 6(10): 9-19.
[11] 汪雪锋, 任惠超, 刘玉琴. 融合聚类信息的技术主题图可视化方法研究*[J]. 数据分析与知识发现, 2022, 6(1): 91-100.
[12] 王若琳, 牛振东, 蔺奇卡, 朱一凡, 邱萍, 陆浩, 刘东磊. 基于异质信息嵌入与RNN聚类参数预测的作者姓名消歧方法*[J]. 数据分析与知识发现, 2021, 5(8): 13-24.
[13] 王晰巍,贾若男,韦雅楠,张柳. 多维度社交网络舆情用户群体聚类分析方法研究*[J]. 数据分析与知识发现, 2021, 5(6): 25-35.
[14] 卢利农,祝忠明,张旺强,王小春. 基于Lingo3G聚类算法的机构知识库跨库知识整合与知识指纹服务实现[J]. 数据分析与知识发现, 2021, 5(5): 127-132.
[15] 张梦瑶, 朱广丽, 张顺香, 张标. 基于情感分析的微博热点话题用户群体划分模型 *[J]. 数据分析与知识发现, 2021, 5(2): 43-49.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn