Please wait a minute...
New Technology of Library and Information Service  2011, Vol. 27 Issue (9): 34-40    DOI: 10.11925/infotech.1003-3513.2011.09.06
Current Issue | Archive | Adv Search |
An Improved Method for Determining Optimal Number of Clusters in K-means Clustering Algorithm
Bian Peng1,2, Zhao Yan3, Su Yuzhao1,2
1. National Science Library, Chinese Academy of Sciences, Beijing 100190, China;
2. Graduate University of Chinese Academy of Sciences, Beijing 100049, China;
3. Computer Science and Application Department, Zhengzhou Institute of Aeronautical Industry Management, Zhengzhou 450015, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  Based on the text clustering requirement from the embedded NSTL Recommending System, this paper researches on the BWP algorithm, and analyzes the shortage of the BWP. Then an improved algorithm is proposed to optimize the calculation of the distance within the single sample cluster. The improved algorithm enlarges the range of clusters number based on the BWP. Moreover, it changes the partial optimum into the whole optimum. At last, the test result shows it is effective and efficient.
Key wordsK-means cluster      Cluster number      Text clustering      Recommending system     
Received: 12 July 2011      Published: 02 December 2011
: 

TP18 G350

 

Cite this article:

Bian Peng, Zhao Yan, Su Yuzhao. An Improved Method for Determining Optimal Number of Clusters in K-means Clustering Algorithm. New Technology of Library and Information Service, 2011, 27(9): 34-40.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2011.09.06     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2011/V27/I9/34

[1] Calinski R, Harabasz J. A Dendrite Method for Cluster Analysis[J]. Communications in Statistics, 1974,3(1):1-27.

[2] Davies D L, Bouldin D W. A Cluster Separation Measure[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1979,1(2):224-227.

[3] Dudoit S, Fridlyand J. A Prediction-based Resampling Method for Estimating the Number of Clusters in a Dataset[J]. Genome Biology, 2002,3(7):1-21.

[4] Dimitriadou E, Dolnicar S, Weingessel A. An Examination of Indexes for Determining the Number of Cluster in Binary Datasets[J]. Psychometrika, 2002,67(1):137-160.

[5] Kapp A V, Tibshirani R. Are Clusters Found in One Dataset Present in Another Dataset?[J]. Biostatistics, 2007,8(1):9-31.

[6] 周世兵,徐振源,唐旭清,K-means 算法最佳聚类数确定方法[J].计算机应用,2010,30(8):1995-1998.

[7] Rousseeuw P J. A Graphical Aid to the Interpretation and Validation of Cluster Analysis[J].Journal of Computational and Applied Mathematics, 1987,20(1):53-65.

[8] MacQueen J. Some Methods for Classification and Analysis of Multivariate Observations[C]. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability.Berkeley: University of California Press, 1967:281-297.

[9] 谢娟英,蒋帅,王春霞,等,一种改进的全局K-均值聚类算法[J].陕西师范大学学报:自然科学版,2010,38(2):18-22.

[10] 李飞,薛彬,黄亚楼,等,初始中心优化的K-means聚类算法[J].计算机科学,2002,29(7):94-96.

[11] 姜园,张朝阳,仇佩亮,等.用于数据挖掘的聚类算法[J].电子与信息学报,2005,27(4):655-662.

[12] Pelleg D, Moore A. X-means: Extending K-means with Efficient Estimation of the Number of Clusters[C]. In: Proceedings of the 17th ICML. 2000: 727-734.
[1] Huaming Zhao,Li Yu,Qiang Zhou. Determining Best Text Clustering Number with Mean Shift Algorithm[J]. 数据分析与知识发现, 2019, 3(9): 27-35.
[2] Quan Lu,Anqi Zhu,Jiyue Zhang,Jing Chen. Research on User Information Requirement in Chinese Network Health Community: Taking Tumor-forum Data of Qiuyi as an Example[J]. 数据分析与知识发现, 2019, 3(4): 22-32.
[3] Zhang Tao,Ma Haiqun. Clustering Policy Texts Based on LDA Topic Model[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
[4] Jia Xiaoting,Wang Mingyang,Cao Yu. Automatic Abstracting of Chinese Document with Doc2Vec and Improved Clustering Algorithm[J]. 数据分析与知识发现, 2018, 2(2): 86-95.
[5] Wang Xueying,Zhang Zixuan,Wang Hao,Deng Sanhong. Evaluating Brands of Agriculture Products: A Literature Review[J]. 数据分析与知识发现, 2017, 1(7): 13-21.
[6] Guan Qin,Deng Sanhong,Wang Hao. Chinese Stopwords for Text Clustering: A Comparative Study[J]. 数据分析与知识发现, 2017, 1(3): 72-80.
[7] Chen Dongyi,Zhou Zicheng,Jiang Shengyi,Wang Lianxi,Wu Jialin. A Framework for Customer Segmentation on Enterprises’ Microblog[J]. 现代图书情报技术, 2016, 32(2): 43-51.
[8] Gong Kaile,Cheng Ying,Sun Jianjun. Clustering Blog Posts with Co-occurrence Analysis[J]. 现代图书情报技术, 2016, 32(10): 50-58.
[9] Ren Yuwei, Lv Xueqiang, Li Zhuo, Xu Liping. Named Entity Recognition from Search Log[J]. 现代图书情报技术, 2015, 31(6): 49-56.
[10] Xiao Tianjiu, Liu Ying. Words and N-gram Models Analysis for “A Dream of Red Mansions”[J]. 现代图书情报技术, 2015, 31(4): 50-57.
[11] Zhang Wenjun, Wang Jun, Xu Shanchuan. The Probing of E-commerce User Need States by Page Cluster Analysis ——An Empirical Study on Women's Clothes from Taobao.com[J]. 现代图书情报技术, 2015, 31(3): 67-74.
[12] Gu Xiaoxue, Zhang Chengzhi. Using Content and Tags for Web Text Clustering[J]. 现代图书情报技术, 2014, 30(11): 45-52.
[13] Xu Xin, Hong Yunjia. Study on Text Visualization of Clustering Result for Domain Knowledge Base —— Take Knowledge Base of Chinese Cuisine Culture as the Object[J]. 现代图书情报技术, 2014, 30(10): 25-32.
[14] Deng Sanhong,Wan Jiexi,Wang Hao,Liu Xiwen. Experimental Study of Multilingual Text Clustering[J]. 现代图书情报技术, 2014, 30(1): 28-35.
[15] Zhao Hui, Liu Huailiang. Research on Short Text Clustering Algorithm for User Generated Content[J]. 现代图书情报技术, 2013, 29(9): 88-92.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn