Please wait a minute...
New Technology of Library and Information Service  2011, Vol. 27 Issue (5): 28-35    DOI: 10.11925/infotech.1003-3513.2011.05.05
Current Issue | Archive | Adv Search |
Survey on K-means Algorithm
Wu Suhui1, Cheng Ying1, Zheng Yanning2, Pan Yuntao2
1. Department of Information Management, Nanjing University, Nanjing 210093,China;
2. Institute of Scientific & Technical Information of China, Beijing 100038,China
Download: PDF(957 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  The main problems of K-means algorithm which is a basic algorithm in clustering are outlined in this paper, such as determination of the optimal clusters, selection of initial centers, and categorical data clustering, etc. The development, the hot spots and difficulties of this algorithm are clarified, and some ideas are introduced to improve the algorithm efficiency.
Key wordsK-means algorithm      Clustering algorithm      Number of clusters      Initial clustering centers     
Received: 21 March 2011      Published: 11 July 2011
: 

G202

 

Cite this article:

Wu Suhui, Cheng Ying, Zheng Yanning, Pan Yuntao. Survey on K-means Algorithm. New Technology of Library and Information Service, 2011, 27(5): 28-35.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2011.05.05     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2011/V27/I5/28

[1] Hartigan J A. Clustering Algorithms [M]. New York: John Wiley & Sons Inc., 1975.

[2] MacQueen J. Some Methods for Classification and Analysis of Multivariate Observations . In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, University of California Press,1967:281-297.

[3] Cox D R. Note on Grouping [J].Journal of the American Statistical Association, 1957, 52(280): 543-547.

[4] Fisher W D. On Grouping for Maximum Homogeneity[J].Journal of the American Statistical Association, 1958, 53(284):789-798.

[5] Sebestyen G S. Decision Making Process in Pattern Recognition[M]. New York: Macmillan,1962.

[6] Wilpon J G, Rabiner L R. A Modified K-means Clustering Algorithm for Use in Isolated Word Recognition[J]. IEEE Transactions on Acoustics Speech and Signal Processing, 1985,33(3):587-594.

[7] McBratney A B, De Gruijter J J. A Continuum Approach to Soil Classification by Modified Fuzzy K-means with Extragrades[J]. Journal of Soil Science,1992,43(1):159-175.

[8] Simek J F. A K-means Approach to the Analysis of Spatial Structure in Upper Paleolithic Habitation Sites[M]. Oxford, England: B.A.R.,1984:400-401.

[9] Hartigan J A. Asymptotic Distributions for Clustering Criteria[J].The Annuals of Statistics, 1978,6(1):117-131.

[10] Pollard D. Strong Consistency of K-means Clustering[J].The Annuals of Statistics,1981,9(1):135-140.

[11] Pollard D. A Central Limit Theorem for K-means Clustering[J].The Annuals of Probability, 1982,10(4):919-926.

[12] Creator R, Kaufman L, Source R, et al. K-means Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,1984,6(1):81-87.

[13] Chinrungrueng C, Sequin C H. Optimal Adaptive K-means Algorithm with Dynamic Adjustment of Learning Rate[J]. IEEE Transactions on Neural Networks,1995,6(1):157-169.

[14] Davies D L, Bouldin D W. A Cluster Separation Measure[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,1979(2):224-227.

[15] Dunn J C.A Fuzzy Relative of the Isodata Process and Its Use in Detecting Compact Well-Separated Clusters[J]. Cybernetics and Systems,1973,3(3):32-57.

[16] Rezaee M R, Lelieveldt B P, Reiber J H. A New Cluster Validity Index for the Fuzzy C-Means[J]. Pattern Recognition Letters,1998,19(3-4):237-246.

[17] Hubert L J, Arabie P. Comparing Partitions[J]. Journal of Classification,1985,2(1): 193-218.

[18] Bezdek J C,Pal N R. Some New Indexes of Cluster Validity[J]. IEEE Transactions on Systems, Man, and Cybernetics,1998,28(3):301-315.

[19] 李永森,杨善林,马溪骏,等. 空间聚类算法中的 K值优化问题研究[J]. 系统仿真学报 ,2006,18(3):573-576.

[20] 张逸清,刘文才.聚类数的确定[J]. 计算机与数字工程 ,2007,35(2):42-44.

[21] 张忠平,王爱杰,柴旭光.简单有效的确定聚类数目算法[J]. 计算机工程与应用 ,2009,45(15):166-168.

[22] Bandyopadhyay S, Maulik U. Genetic Clustering for Automatic Evolution of Clusters and Application to Image Classification[J]. Pattern Recognition,2002,35(6):1197-1208.

[23] Lin H J, Yang F W, Kao Y T. An Efficient GA-based Clustering Technique [J]. Tamkang Journal of Science and Engineering,2005,8(2):113-122.

[24] Liu Y, Ye M, Peng J, et al. Finding the Optimal Number of Clusters Using Genetic Algorithms .In:Proceedings of IEEE International Conference on Cybernetic Intelligent Systems.2008:1325-1330.

[25] 巩敦卫,蒋余庆,张勇,等.基于微粒群优化聚类数目的K-均值算法[J]. 控制理论与应用 ,2009,26(10):1175-1179.

[26] Van Der Merwe D W, Engelbrecht A P. Data Clustering Using Particle Swarm Optimization . In:Proceedings of the 2003 Congress on Evolutionary Computation, Canberra,Australia.2003:215-220.

[27] Omran M, Engelbrecht A P, Salman A. Particle Swarm Optimization Method for Image Clustering[J]. International Journal of Pattern Recognition and Artificial Intelligence,2005,19(3):297-321.

[28] Xu L, Krzyzak A, Oja E. Rival Penalized Competitive Learning for Clustering Analysis, RBF Net,and Curve Detection[J]. IEEE Transactions on Neural Networks, 1993, 4(4):636-649.

[29] Xu L, Krzyzak A, Oja E. Unsupervised and Supervised Classification by Rival Penalized Competitive Learning . In:Proceedings of the 11th International Conference on Pattern Recognition.1992:496-499.

[30] Pelleg D, Moore A. X-means: Extending K-means with Efficient Estimation of the Number of Clusters . In:Proceeding of the 17th International Conference on Machine Learning.2000:727-734.

[31] Duda R O, Hart P E. Pattern Classification and Scene Analysis[M]. New York: John Wiley & Sons Inc.,1973.

[32] Katsavounidis I, Kuo C J, Zhang Z. A New Initialization Technique for Generalized Loyd Iteration[J]. IEEE Signal Processing Letters,1994,1(10):144-146.

[33] Khan S S, Ahmad A. Cluster Center Initialization Algorithm for K-means Clustering[J]. Pattern Recognition Letters,2004,25(11):1293-1302.

[34] Redmond S J, Heneghan C. A Method for Initialising the K-means Clustering Algorithm Using Kd-trees[J]. Pattern Recognition Letters,2007,28(8):965-973.

[35] 张文君,顾行发,陈良富,等.基于均值-标准差的K均值初始聚类中心选取算法[J]. 遥感学报 ,2006,10(5):715-721.

[36] 牛琨,张舒博,陈俊亮.融合网格密度的聚类中心初始化方案[J]. 北京邮电大学学报 ,2007,30(2):6-10.

[37] 张文明,吴江,袁小蛟. 基于密度和最近邻的K-means文本聚类算法[J]. 计算机应用 ,2010,30(7):1933-1935.

[38] Metropolis N, Rosenbluth A W, Rosenbluth M N, et al. Equation of State Calculations by Fast Computing Machines[J]. Journal of Chemical Physics, 1953,21(6):1087-1092.

[39] Klein R W, Dubes R C. Experiments in Projection and Clustering by Simulated Annealing[J]. Pattern Recognition,1989,22(2):213-220.

[40] Dong J, Qi M. K-means Optimization Algorithm for Solving Clustering Problem . In:Proceedings of the 2nd International Workshop on Knowledge Discovery and Data Mining,Moscow.2009:52-55.

[41] Bhuyan J N, Raghavan V V, Elayavalli V K. Genetic Algorithm for Clustering with an Ordered Representation .In:Proceedings of the 4th International Conference Genetic Algorithms,San Diego, CA,USA.1991.

[42] Jones D R, Beltramo M A. Solving Partitioning Problems with Genetic Algorithms .In: Proceedings of the 4th International Conference Genetic Algorithms,San Diego,CA,USA.1991:442-494.

[43] Babu G P, Murty N M. A Near-optimal Initial Seed Selection in K-means Algorithm Using a Genetic Algorithm[J]. Pattern Recognit Letters,1993,14(10):763-769.

[44] Krishna K, Murty N M. Genetic K-means Algorithm[J]. IEEE Transactions on Systems, Man and Cybernetics,1999,29(3):433-439.

[45] Maulik U, Bandyopadhyay S. Genetic Algorithm-based Clustering Technique[J].Pattern Recognition,2000,33(9):1455-1465.

[46] Laszlo M, Mukherjee S. A Genetic Algorithm that Exchanges Neighboring Centers for K-means Clustering[J]. Pattern Recognition Letters,2007,28(16):2359-2366.

[47] Bradley P S, Fayyad U M. Refining Initial Points for K-means Clustering . In: Proceedings of the 15th International Conference on Machine Learning.1998:91-99.

[48] Bandyopadhyay S, Maulik U, Pakhira M K. Clustering Using Simulated Annealing with Probabilistic Redistribution[J]. International Journal of Pattern Recognition and Artificial Intelligence,2001,15(2):269-285.

[49] Zalik K R. An Efficient K-means Clustering Algorithm[J].Pattern Recognition Letters, 2008,29(9):1385-1391.

[50] Cao F,Liang J,Jiang G. An Initialization Method for the K-means Algorithm Using Neighborhood Model[J]. Computers & Mathematics with Applications,2009,58(3):474-483.

[51] 刘立平,孟志青. 一种选取初始聚类中心的方法[J]. 计算机工程与应用 ,2004,40(8):179-180.

[52] 王汉芝,刘振全. 一种新的确定均值算法初始聚类中心的方法[J]. 天津科技大学学报 ,2005,20(4):76-79.

[53] 戴文华. 基于遗传算法的文本分类及聚类研究[M]. 北京:科学出版社, 2008.

[54] Ralambondarmy H. A Conceptual Version of the K-means Algorithm[J]. Pattern Recognition Lcttcrs,1995,16(11):1147-1157.

[55] Huang Z X. A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining . In:Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery.1997:1-8.

[56] Huang Z X. Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values[J]. Data Mining and Knowledge Discovery,1998,2(3):283-304.

[57] Chaturvedi A D, Green P E, Carroll J D. K-modes Clustering[J]. Journal of Classification, 2001,18(1):35-55.

[58] Huang Z X, Ng M K. A Note on K-modes Clustering[J]. Journal of Classification,2003,20(2): 257-261.

[59] Sun Y, Zhu Q M, Chen Z X. An Iterative Initial-points Refinement Algorithm for Categorical Data Clustering[J]. Pattern Recognition Letters, 2002,23(7):875-884.

[60] 蒋盛益,李庆华.一种增强的K-means 聚类算法[J]. 计算机工程与科学 ,2006,28(11):56-59.

[61] Roy D K, Sharma L K. Genetic K-means Clustering Algorithm for Mixed Numeric and Categorical Data Sets[J]. International Journal of Artificial Intelligence & Applications,2010,1(2):23-28.
[1] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[2] Hongwei Liu,Hongming Gao,Li Chen,Mingjun Zhan,Zhouyang Liang. Identifying User Interests Based on Browsing Behaviors[J]. 数据分析与知识发现, 2018, 2(2): 74-85.
[3] Ding Shengchun,Gong Silan,Li Hongmei. A New Method to Detect Bursty Events from Micro-blog Posts Based on Bursty Topic Words and Agglomerative Hierarchical Clustering Algorithm[J]. 现代图书情报技术, 2016, 32(7-8): 12-20.
[4] Zhao Hui, Liu Huailiang. Research on Short Text Clustering Algorithm for User Generated Content[J]. 现代图书情报技术, 2013, 29(9): 88-92.
[5] He Wenjing, He Lin. Research on Text Clustering Based on Social Tagging[J]. 现代图书情报技术, 2013, 29(7/8): 49-54.
[6] Wang Dongbo, Han Pu, Shen Si, Wei Xiangqing. Research of Mining the Category Knowledge Based on English-Chinese Humanities and Social Sciences Parallel Corpus in Phrase Level[J]. 现代图书情报技术, 2012, (11): 40-46.
[7] Li Ying Wu Yuanyuan Ning Fujin. Application of a Modified K-means Clustering Algorithm Based on PSO in Customer Segmentation of Securities Industry[J]. 现代图书情报技术, 2010, 26(7/8): 88-94.
[8] Wang Jiandong. Domestic Information Services Research Concept Network Analysis Based on Complex Network Method[J]. 现代图书情报技术, 2009, (10): 56-61.
[9] Cao Gaohui,Jiao Yuying,Cheng Quan. Research on Tag Cluster Based on Hierarchical Agglomerative Clustering Algorithm[J]. 现代图书情报技术, 2008, 24(4): 23-28.
[10] Ji Yonghui. A Research on Retrieval Results Clustering and Relevant Recommendation in Digital Library[J]. 现代图书情报技术, 2008, 24(2): 69-75.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn