基于自适应k均值聚类的距离加权欠采样算法*

doi:10.11925/infotech.2096-3467.2021.0847

数据分析与知识发现

2022, Vol. 6

Issue (5): 127-136 https://doi.org/10.11925/infotech.2096-3467.2021.0847

研究论文

本期目录 | 过刊浏览 | 高级检索

基于自适应k均值聚类的距离加权欠采样算法*

周倩¹,姚震²(

),孙博¹

¹山东农业大学信息科学与工程学院泰安 271018
²山东农业大学图书馆泰安 271018

Under-sampling Algorithm with Weighted Distance Based on Adaptive K-Means Clustering

Zhou Qian¹,Yao Zhen²(

),Sun Bo¹

¹College of Information Science and Engineering, Shandong Agricultural University, Taian 271018, China
²Library of Shandong Agricultural University, Taian 271018, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (800 KB) HTML ( 26 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】 消除分类问题中类不平衡数据对分类精度的影响。【方法】 首先,使用自适应k均值聚类算法对多数类数据集进行聚类,找到并删除离群点;其次,计算数据与聚类中心加权距离并排序,根据簇密度对多数类数据顺序采样;最后,将采样得到的数据与少数类数据集合并,输入分类算法进行训练。【结果】 实验结果表明,在25组不平衡数据集上算法最大AUC平均值达到0.912,相比较于其他方法最少提升了0.014,平均运行时间仅为1.377 s;应用在两组不平衡大数据集上,算法也有很好的表现。【局限】 不适合多分类问题,仅适合解决二分类问题。【结论】 算法能够找到最适k值,检测并删除离群点,解决类不平衡问题,提高分类精度。算法速度快,开销小,适合不平衡大数据集的应用。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	周倩
	姚震
	孙博

关键词 ：类不平衡, 聚类, 距离加权, 欠采样

Abstract：

[Objective] This study tries to reduce the impacts of imbalanced data on classification accuracy. [Methods] First, we used the adaptive k-means clustering algorithm to process the majority class and remove the outliers. Then, we calculated the weighted distance between data and the centers of the clusters to sort the weighted distances. We also sequentially sampled the majority class according to the density of the clusters. Finally, we trained the classification algorithm combining of the sampled data and the minority class. [Results] The average max AUC values reached 0.912 with 25 imbalanced datasets, which was at least 0.014 higher than other methods. Our new algorithm’s average running time was 1.377s, and worked well with imbalanced big data sets. [Limitations] The proposed model could not address the multi-classification issues. [Conclusions] This new algorithm could ide.pngy the optimal k-value, detect and remove the outliers, solve class imbalance problem, and improve classification accuracy. It is capable of processing imbalanced large data sets faster and cost-effectively.

Key words： Class Imbalance Clustering Weighted Distance Undersampling

收稿日期: 2021-08-13 出版日期: 2022-06-21

ZTFLH:

TP393

基金资助:*山东省自然科学基金青年基金项目(ZR2018QF002);山东农业大学图书情报研究项目的研究成果之一(TQ201902)

通讯作者: 姚震,ORCID：0000-0003-2848-939X E-mail: yz@sdau.edu.cn

引用本文:

周倩, 姚震, 孙博. 基于自适应k均值聚类的距离加权欠采样算法*[J]. 数据分析与知识发现, 2022, 6(5): 127-136.
Zhou Qian, Yao Zhen, Sun Bo. Under-sampling Algorithm with Weighted Distance Based on Adaptive K-Means Clustering. Data Analysis and Knowledge Discovery, 2022, 6(5): 127-136.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0847 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I5/127

Fig.1 KCDWU算法功能模拟图

Fig.2 KCDWU算法流程

Table 1 数据集详细信息

Table 2 混淆矩阵

Table 3 小样本数据集实验结果

Table 4 不同方法的AUC值

Fig.3 Letter-A数据集上的优化

Fig.4 Adult数据集上的优化

[1]	Zhang J, Chen L, Tian J X, et al. Breast Cancer Diagnosis Using Cluster-Based Undersampling and Boosted C5.0 Algorithm[J]. International Journal of Control, Automation and Systems, 2021, 19(5):1998-2008. doi: 10.1007/s12555-019-1061-x
[2]	邓成越. 高校图书馆社会化服务中用户信用体系研究[J]. 图书情报工作, 2018, 62(23):59-64.
[2]	( Deng Chengyue. Study on User Credit System in the Socialized Service of University Libraries[J]. Library and Information Service, 2018, 62(23):59-64.)
[3]	Lin W C, Tsai C F, Hu Y H, et al. Clustering-Based Undersampling in Class-imbalanced Data[J]. Information Sciences, 2017, 409-410:17-26. doi: 10.1016/j.ins.2017.05.008
[4]	Sahin Y, Bulkan S, Duman E. A Cost-Sensitive Decision Tree Approach for Fraud Detection[J]. Expert Systems with Applications, 2013, 40(15): 5916-5923. doi: 10.1016/j.eswa.2013.05.021
[5]	Zheng W J, Zhao H. Cost-Sensitive Hierarchical Classification for Imbalance Classes[J]. Applied Intelligence, 2020, 50(8): 2328-2338. doi: 10.1007/s10489-019-01624-z
[6]	Yu H L, Mu C X, Sun C Y, et al. Support Vector Machine-Based Optimized Decision Threshold Adjustment Strategy for Classifying Imbalanced Data[J]. Knowledge-Based Systems, 2015, 76: 67-78. doi: 10.1016/j.knosys.2014.12.007
[7]	Ohsaki M, Wang P, Matsuda K, et al. Confusion-Matrix-Based Kernel Logistic Regression for Imbalanced Data Classification[J]. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(9): 1806-1819. doi: 10.1109/TKDE.2017.2682249
[8]	Kim M J, Kang D K, Kim H B. Geometric Mean Based Boosting Algorithm with Over-sampling to Resolve Data Imbalance Problem for Bankruptcy Prediction[J]. Expert Systems with Applications, 2015, 42(3): 1074-1082. doi: 10.1016/j.eswa.2014.08.025
[9]	Chawla N V, Lazarevic A, Hall L O, et al. SMOTEBoost: Improving Prediction of the Minority Class in Boosting[C]// Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery. 2003:107-119.
[10]	Seiffert C, Khoshgoftaar T M, van Hulse J, et al. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance[J]. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 2010, 40(1): 185-197. doi: 10.1109/TSMCA.2009.2029559
[11]	Sun J, Lang J, Fujita H, et al. Imbalanced Enterprise Credit Evaluation with DTE-SBD: Decision Tree Ensemble Based on SMOTE and Bagging with Differentiated Sampling Rates[J]. Information Sciences, 2018, 425: 76-91. doi: 10.1016/j.ins.2017.10.017
[12]	Zyblewski P, Sabourin R, Woźniak M. Preprocessed Dynamic Classifier Ensemble Selection for Highly Imbalanced Drifted Data Streams[J]. Information Fusion, 2021, 66: 138-154. doi: 10.1016/j.inffus.2020.09.004
[13]	Wang S, Yao X. Diversity Analysis on Imbalanced Data Sets by Using Ensemble Models[C]// Proceedings of 2009 IEEE Symposium on Computational Intelligence and Data Mining. 2009: 324-331.
[14]	Galar M, Fernandez A, Barrenechea E, et al. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2012, 42(4): 463-484.
[15]	Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-sampling Technique[J]. Journal of A.pngicial Intelligence Research, 2002, 16: 321-357.
[16]	He H B, Bai Y, Garcia E A, et al. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning[C]// Proceedings of 2008 IEEE International Joint Conference on Neural Networks. 2008: 1322-1328.
[17]	Han H, Wang W Y, Mao B H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning[C]// Proceedings of International Conference on Intelligent Computing. 2005: 878-887.
[18]	陆妙芳, 杨有龙. 基于密度峰值聚类和径向基函数的过采样算法[J/OL]. 计算机工程与应用. https://kns.cnki.net/kcms/detail/11.2127.TP.20210521.1100.006.htm.
[18]	( Lu Miaofang, Yang Youlong. Oversampling Algorithm Based on Density Peak Clustering and Radial Basis Function[J/OL]. Computer Engineering and Applications. https://kns.cnki.net/kcms/detail/11.2127.TP.20210521.1100.006.htm.)
[19]	Jindaluang W, Chouvatut V, Kantabutra S. Under-Sampling by Algorithm with Performance Guaranteed for Class-imbalance Problem[C]// Proceedings of 2014 International Computer Science and Engineering Conference. 2014: 215-221.
[20]	Tsai C F, Lin W C, Hu Y H, et al. Under-Sampling Class Imbalanced Datasets by Combining Clustering Analysis and Instance Selection[J]. Information Sciences, 2019, 477: 47-54. doi: 10.1016/j.ins.2018.10.029
[21]	肖连杰, 郜梦蕊, 苏新宁. 一种基于模糊C-均值聚类的欠采样集成不平衡数据分类算法[J]. 数据分析与知识发现, 2019, 3(4): 90-96.
[21]	( Xiao Lianjie, Gao Mengrui, Su Xinning. An Under-Sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data[J]. Data Analysis and Knowledge Discovery, 2019, 3(4): 90-96.)
[22]	崔彩霞, 曹付元, 梁吉业. 基于密度峰值聚类的自适应欠采样方法[J]. 模式识别与人工智能, 2020, 33(9): 811-819.
[22]	( Cui Caixia, Cao Fuyuan, Liang Jiye. Adaptive Undersampling Based on Density Peak Clustering[J]. Pattern Recognition and A.pngicial Intelligence, 2020, 33(9): 811-819.)
[23]	Yen S J, Lee Y S. Cluster-Based Under-Sampling Approaches for Imbalanced Data Distributions[J]. Expert Systems with Applications, 2009, 36(3): 5718-5727. doi: 10.1016/j.eswa.2008.06.108
[24]	Sobhani P, Viktor H, Matwin S. Learning from Imbalanced Data Using Ensemble Methods and Cluster-Based Undersampling[C]// Proceedings of the 3rd International Conference on New Frontiers in Mining Complex Patterns. 2014: 69-83.
[25]	Macqueen J B. Some Methods for Classification and Analysis of Multivariate Observations[C]// Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967: 281-297.
[26]	Zseby T. Str.pngication Strategies for Sampling-based Non-intrusive Measurements of One-way Delay[C]// Proceedings of Passive and Active Measurement Workshop (PAM 2003). 2003.
[27]	Alcalá-Fdez J, Fernández A, Luengo J, et al. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework[J]. Journal of Multiple-Valued Logic and Soft Computing, 2011, 17(2-3): 255-287.
[28]	Dua D, Graff C. UCI Machine Learning Repository[DB/OL]. [2021-07-08]. http://archive.ics.uci.edu/ml.
[29]	Spackman K A. Signal Detection Theory: Valuable Tools for Evaluating Inductive Learning[C]// Proceedings of the 6th International Workshop on Machine Learning. 1989: 160-163.

[1]	崔骥, 张金鹏, 包舟, 丁晟春. 基于趋势度分析的科技领域核心主题发展预测^*[J]. 数据分析与知识发现, 2022, 6(9): 1-13.
[2]	张军亮, 方雪梅, 张帆, 刘喜文, 朱鹏. 基于复杂网络的医学语义关联研究^*[J]. 数据分析与知识发现, 2022, 6(9): 125-137.
[3]	吴江, 刘涛, 刘洋. 在线社区用户画像及自我呈现主题挖掘——以网易云音乐社区为例^*[J]. 数据分析与知识发现, 2022, 6(7): 56-69.
[4]	薛菁菁, 秦永彬, 黄瑞章, 任丽娜, 陈艳平. SSVAE：一种补充语义信息的深度变分文本聚类模型^*[J]. 数据分析与知识发现, 2022, 6(6): 71-83.
[5]	胡吉明, 郑翔. 基于主题聚类的新媒体政务互动内容摘要生成研究^*[J]. 数据分析与知识发现, 2022, 6(6): 95-104.
[6]	郭蕾, 刘文菊, 王赜, 任悦强. 融合谱聚类和多因素影响的兴趣点推荐方法*[J]. 数据分析与知识发现, 2022, 6(5): 77-88.
[7]	聂卉, 吴晓燕, 林芸. 基于在线问诊记录的抑郁症病患群组划分与特征分析^*[J]. 数据分析与知识发现, 2022, 6(2/3): 222-232.
[8]	钱旦敏, 曾婷婷, 常侍艺. 突发公共卫生事件下基于在线健康社区用户画像的用户角色研究^*[J]. 数据分析与知识发现, 2022, 6(2/3): 93-104.
[9]	吴振峰, 兰天, 王猛猛, 浦墨, 张昱, 刘志辉, 何彦青. 基于共享最近邻和马尔科夫聚类的网络新闻话题检测方法^*[J]. 数据分析与知识发现, 2022, 6(10): 103-113.
[10]	冯小东, 惠康欣. 基于异构图神经网络的社交媒体文本主题聚类^*[J]. 数据分析与知识发现, 2022, 6(10): 9-19.
[11]	汪雪锋, 任惠超, 刘玉琴. 融合聚类信息的技术主题图可视化方法研究*[J]. 数据分析与知识发现, 2022, 6(1): 91-100.
[12]	王若琳, 牛振东, 蔺奇卡, 朱一凡, 邱萍, 陆浩, 刘东磊. 基于异质信息嵌入与RNN聚类参数预测的作者姓名消歧方法^*[J]. 数据分析与知识发现, 2021, 5(8): 13-24.
[13]	王晰巍,贾若男,韦雅楠,张柳. 多维度社交网络舆情用户群体聚类分析方法研究^*[J]. 数据分析与知识发现, 2021, 5(6): 25-35.
[14]	卢利农,祝忠明,张旺强,王小春. 基于Lingo3G聚类算法的机构知识库跨库知识整合与知识指纹服务实现[J]. 数据分析与知识发现, 2021, 5(5): 127-132.
[15]	张梦瑶, 朱广丽, 张顺香, 张标. 基于情感分析的微博热点话题用户群体划分模型 ^*[J]. 数据分析与知识发现, 2021, 5(2): 43-49.

Viewed

Full text

Abstract

Cited

Shared

Discussed