Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (4): 90-96    DOI: 10.11925/infotech.2096-3467.2018.0533
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
一种基于模糊C-均值聚类的欠采样集成不平衡数据分类算法*
肖连杰(),郜梦蕊,苏新宁
南京大学信息管理学院 南京 210023
江苏省数据工程与知识服务重点实验室 南京 210023
An Under-sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data
Lianjie Xiao(),Mengrui Gao,Xinning Su
School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
全文: PDF(973 KB)   HTML ( 3
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】解决二分类任务中因类间数据不平衡导致少数类分类准确度低的问题。【方法】提出一种基于模糊C-均值聚类的欠采样集成不平衡数据分类算法(ECFCM), 即对多数类样本进行基于 FCM聚类的欠采样, 将聚类中心样本与全部少数类样本组成平衡数据集; 利用基于Bagging的集成学习算法对平衡数据集进行分类。【结果】在4组不平衡数据集上的Matlab仿真实验结果表明, ECFCM算法的Acc、AUC和F1提升幅度最高为5.75% (Spambase), 13.84% (Glass2)和7.54% (Spambase)。【局限】本文采用标准数据集验证ECFCM算法的有效性, 当采用实际应用中的不平衡数据时, 需要有针对性地研究不平衡数据分类算法。【结论】ECFCM算法分类性能良好, 在一定程度上有利于提高不平衡数据中少数类的分类准确度。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
肖连杰
郜梦蕊
苏新宁
关键词 不平衡数据模糊C-均值聚类分类欠采样集成学习    
Abstract

[Objective] This paper tries to solve the problem of the low accuracy of minority classification in the binary classification task due to class imbalance. [Methods] An under-sampling ensemble classification algorithm based on fuzzy c-means(FCM) clustering for imbalanced data is proposed. That is, the majority class samples are under-sampled based on FCM clustering, all these cluster center samples and all the minority samples are made up to a balance data set. We use the integrated learning algorithm based on Bagging to classify the balanced data sets. [Results] The Matlab simulation results of experiments on four imbalanced datasets show that the ECFCM algorithm improves Acc, AUC and F1 by up to 5.75%, 13.84% and 7.54%. [Limitations] Some standard data sets are used to verify the effectiveness of ECFCM. When in a specific application, a targeted research on classification algorithm is needed. [Conclusions] The ECFCM algorithm performs good to a certain extent, which is conducive to improve the binary classification accuracy of the minority class on imbalanced datasets.

Key wordsImbalanced Data    Fuzzy C-Means Clustering    Classification    Under-sampling    Ensemble Learning
收稿日期: 2018-05-11     
基金资助:*本文系国家社会科学基金重大项目“情报学学科建设与情报工作未来发展路径研究”(项目编号: 17ZDA291)和南京大学研究生跨学科科研创新项目“大数据环境下情报学理论方法知识库构建研究”(项目编号: 2018ZDW03)的研究成果之一
引用本文:   
肖连杰,郜梦蕊,苏新宁. 一种基于模糊C-均值聚类的欠采样集成不平衡数据分类算法*[J]. 数据分析与知识发现, 2019, 3(4): 90-96.
Lianjie Xiao,Mengrui Gao,Xinning Su. An Under-sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2018.0533.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.0533
[1] He H, Garcia E A.Learning from Imbalanced Data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.
[2] Yang X, Lo D, Huang Q, et al.Automated Identification of High Impact Bug Reports Leveraging Imbalanced Learning Strategies[C]//Proceedings of the 40th IEEE Annual Computer Software and Applications Conference, Atlanta, Georgia,USA. IEEE Press, 2016: 227-232.
[3] Zakaryazad A, Duman E.A Profit-driven Artificial Neural Network (ANN) with Applications to Fraud Detection and Direct Marketing[J]. Neurocomputing, 2016, 175: 121-131.
[4] Prusa J D, Khoshgoftaar T M, Seliya N.Enhancing Ensemble Learners with Data Sampling on High-Dimensional Imbalanced Tweet Sentiment Data[C]//Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference(FLAIRS2016), Florida, USA. AAAI Press, 2016: 322-328.
[5] 方磊, 马溪骏. 基于信息熵的改进型支持向量机客户流失预测模型应用研究[J]. 情报学报, 2011, 30(6):643-648.
[5] (Fang Lei, Ma Xijun.An Applied Research on Improved Entropy-based SVM Churn Prediction Model[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(6): 643-648.)
[6] Galar M, Fernandez A, Barrenechea E, et al.A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches[J]. IEEE Transactions on Systems, Man & Cybernetics, Part C:Applications & Reviews, 2012, 42(4): 463-484.
[7] Liu G, Yang Y, Li B.Fuzzy Rule-based Oversampling Technique for Imbalanced and Incomplete Data Learning[J]. Knowledge-Based Systems, 2018, 158: 154-174.
[8] Lin W C, Tsai C F, Hu Y H, et al. Clustering-based Undersampling in Class-imbalanced Data[J]. Information Sciences, 2017, 409-410: 17-26.
[9] Błaszczyński J, Stefanowski J.Neighbourhood Sampling in Bagging for Imbalanced Data[J]. Neurocomputing, 2015, 150: 529-542.
[10] Batista G E A P A, Prati R C, Monard M C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29.
[11] Zhang J, Mani I. kNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction [C]// Proceedings of the ICML2003 Workshop on Learning from Imbalanced Datasets, Washington, DC, USA. AAAI Press, 2003: 42-48.
[12] Cateni S, Colla V, Vannucci M.A Method for Resampling Imbalanced Datasets in Binary Classification Tasks for Real-World Problems[J]. Neurocomputing, 2014, 135: 32-41.
[13] Ha J, Lee J S.A New Under-Sampling Method Using Genetic Algorithm for Imbalanced Data Classification [C] //Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication, Danang, Vietnam. ACM Press, 2016: Article No.95.
[14] Kocyigit Y, Seker H.Imbalanced Data Classifier by Using Ensemble Fuzzy C-Means Clustering[C]// Proceedings of the IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2012), Hong Kong, China. IEEE Press, 2012: 952-955.
[15] Dunn J C.A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-separated Clusters[J]. Journal of Cybernetics, 1973, 3(3): 32-57.
[16] Bezdek J C, Ehrlich R, Full W.FCM: The Fuzzy C-Means Clustering Algorithm[J]. Computers & Geosciences, 1984, 10(2-3): 191-203.
[17] 蔡静颖. 模糊聚类算法及应用[M]. 北京: 冶金工业出版社, 2015.
[17] (Cai Jingying.Fuzzy Clustering Algorithm and Applications[M]. Beijing: Metallurgical Industry Press, 2015.)
[18] 张翔, 周明全, 耿国华, 等. Bagging算法在中文文本分类中的应用[J]. 计算机工程与应用, 2009, 45(5): 135-137, 179.
[18] (Zhang Xiang, Zhou Mingquan, Geng Guohua, et al.Application of Bagging Algorithm to Chinese Text Categorization[J]. Computer Engineering and Applications, 2009, 45(5): 135-137, 179.)
[19] 沈学华, 周志华, 吴建鑫, 等. Boosting和Bagging综述[J]. 计算机工程与应用, 2000, 36(12): 31-32, 40.
[19] (Shen Xuehua, Zhou Zhihua, Wu Jianxin, et al.Survey of Boosting and Bagging[J]. Computer Engineering and Applications, 2000, 36(12): 31-32, 40.)
[20] 毛国君, 段立娟. 数据挖掘原理与算法 [M]. 第3版. 北京:清华大学出版社, 2016.
[20] (Mao Guojun, Duan Lijuan.The Principle and Algorithm of Data Mining [M]. The Third Edition. Beijing: Tsinghua University Press, 2016.)
[1] 李茹,李锐,蒋捷,吴华意. 网络地图用户访问会话时空特征分析*[J]. 数据分析与知识发现, 2019, 3(6): 1-11.
[2] 周成,魏红芹. 专利价值评估与分类研究*——基于自组织映射支持向量机[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[3] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[4] 张庆庆,贺兴时,王慧敏,蒙胜军. 基于深度信念网络的文本情感分类研究*[J]. 数据分析与知识发现, 2019, 3(4): 71-79.
[5] 桂思思,陆伟,张晓娟. 基于查询表达式特征的时态意图识别研究*[J]. 数据分析与知识发现, 2019, 3(3): 66-75.
[6] 薛翔,赵宇翔. 音乐平台中音乐分类体系的用户心智模型研究*——以高校学生群体为例[J]. 数据分析与知识发现, 2019, 3(2): 1-12.
[7] 谭章禄,王兆刚,胡翰. 一种基于χ2统计的特征分类选择方法研究*[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[8] 张紫玄,王昊,朱立平,邓三鸿. 中国海关HS编码风险的识别研究*[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[9] 李静,刘潇,王效俐. 邻域粗糙集融合网格搜索组合分类器的理财决策知识获取研究*[J]. 数据分析与知识发现, 2019, 3(1): 85-94.
[10] 李慧,柴亚青. 基于卷积神经网络的细粒度情感分析方法*[J]. 数据分析与知识发现, 2019, 3(1): 95-103.
[11] 李湘东,高凡,李悠海. 共通语义空间下的跨文献类型文本自动分类研究*[J]. 数据分析与知识发现, 2018, 2(9): 66-73.
[12] 伍杰华,沈静,周蓓. 基于迁移成分分析的多层社交网络链接分类*[J]. 数据分析与知识发现, 2018, 2(9): 88-99.
[13] 李心蕾,王昊,刘小敏,邓三鸿. 面向微博短文本分类的文本向量化方法比较研究*[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[14] 贾隆嘉,张邦佐. 高校网络舆情安全中主题分类方法研究*——以新浪微博数据为例[J]. 数据分析与知识发现, 2018, 2(7): 55-62.
[15] 李琳,李辉. 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn