一种改进的K-means算法最佳聚类数确定方法

doi:10.11925/infotech.1003-3513.2011.09.06

现代图书情报技术

2011, Vol. 27

Issue (9): 34-40 https://doi.org/10.11925/infotech.1003-3513.2011.09.06

知识组织与知识管理

本期目录 | 过刊浏览 | 高级检索

一种改进的K-means算法最佳聚类数确定方法

边鹏^1,2, 赵妍³, 苏玉召^1,2

1. 中国科学院国家科学图书馆北京 100190;
2. 中国科学院研究生院北京 100049;
3. 郑州航空工业管理学院计算机科学与应用系郑州 450015

An Improved Method for Determining Optimal Number of Clusters in K-means Clustering Algorithm

Bian Peng^1,2, Zhao Yan³, Su Yuzhao^1,2

1. National Science Library, Chinese Academy of Sciences, Beijing 100190, China;
2. Graduate University of Chinese Academy of Sciences, Beijing 100049, China;
3. Computer Science and Application Department, Zhengzhou Institute of Aeronautical Industry Management, Zhengzhou 450015, China

摘要
参考文献
相关文章
Metrics

全文: PDF (683 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要对BWP方法进行研究,从嵌入式NSTL个性化推荐的文本聚类需求入手,分析BWP方法的不足,提出一种改进的K-means算法最佳聚类数确定方法。对单一样本类的类内距离计算方法进行优化,扩展BWP方法适用的聚类数范围,使原有局部最优的聚类数优化为全局最优。实验结果可以验证该方法具有良好性能。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	边鹏
	赵妍
	苏玉召

关键词 ： K-means聚类, 聚类数, 文本聚类, 推荐系统

Abstract：Based on the text clustering requirement from the embedded NSTL Recommending System, this paper researches on the BWP algorithm, and analyzes the shortage of the BWP. Then an improved algorithm is proposed to optimize the calculation of the distance within the single sample cluster. The improved algorithm enlarges the range of clusters number based on the BWP. Moreover, it changes the partial optimum into the whole optimum. At last, the test result shows it is effective and efficient.

Key words： K-means cluster Cluster number Text clustering Recommending system

收稿日期: 2011-07-12 出版日期: 2011-12-02

TP18 G350

引用本文:

边鹏, 赵妍, 苏玉召. 一种改进的K-means算法最佳聚类数确定方法[J]. 现代图书情报技术, 2011, 27(9): 34-40.
Bian Peng, Zhao Yan, Su Yuzhao. An Improved Method for Determining Optimal Number of Clusters in K-means Clustering Algorithm. New Technology of Library and Information Service, 2011, 27(9): 34-40.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2011.09.06 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2011/V27/I9/34

[1] Calinski R, Harabasz J. A Dendrite Method for Cluster Analysis[J]. Communications in Statistics, 1974,3(1):1-27.

[2] Davies D L, Bouldin D W. A Cluster Separation Measure[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1979,1(2):224-227.

[3] Dudoit S, Fridlyand J. A Prediction-based Resampling Method for Estimating the Number of Clusters in a Dataset[J]. Genome Biology, 2002,3(7):1-21.

[4] Dimitriadou E, Dolnicar S, Weingessel A. An Examination of Indexes for Determining the Number of Cluster in Binary Datasets[J]. Psychometrika, 2002,67(1):137-160.

[5] Kapp A V, Tibshirani R. Are Clusters Found in One Dataset Present in Another Dataset?[J]. Biostatistics, 2007,8(1):9-31.

[6] 周世兵,徐振源,唐旭清,K-means 算法最佳聚类数确定方法[J].计算机应用,2010,30(8):1995-1998.

[7] Rousseeuw P J. A Graphical Aid to the Interpretation and Validation of Cluster Analysis[J].Journal of Computational and Applied Mathematics, 1987,20(1):53-65.

[8] MacQueen J. Some Methods for Classification and Analysis of Multivariate Observations[C]. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability.Berkeley: University of California Press, 1967:281-297.

[9] 谢娟英,蒋帅,王春霞,等,一种改进的全局K-均值聚类算法[J].陕西师范大学学报:自然科学版,2010,38(2):18-22.

[10] 李飞,薛彬,黄亚楼,等,初始中心优化的K-means聚类算法[J].计算机科学,2002,29(7):94-96.

[11] 姜园,张朝阳,仇佩亮,等.用于数据挖掘的聚类算法[J].电子与信息学报,2005,27(4):655-662.

[12] Pelleg D, Moore A. X-means: Extending K-means with Efficient Estimation of the Number of Clusters[C]. In: Proceedings of the 17th ICML. 2000: 727-734.

[1]	于硕,Hayat Dino Bedru,储新倍,袁宇渊,万良田,夏锋. 科学发现偶然性研究综述[J]. 数据分析与知识发现, 2021, 5(1): 16-35.
[2]	杨恒,王思丽,祝忠明,刘巍,王楠. 基于并行协同过滤算法的领域知识推荐模型研究*[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[3]	赵华茗,余丽,周强. 基于均值漂移算法的文本聚类数目优化研究 ^*[J]. 数据分析与知识发现, 2019, 3(9): 27-35.
[4]	温彦,马立健,曾庆田,郭文艳. 基于地理信息偏好修正和社交关系偏好隐式分析的POI推荐 ^*[J]. 数据分析与知识发现, 2019, 3(8): 30-39.
[5]	焦富森,李树青. 基于物品质量和用户评分修正的协同过滤推荐算法 ^*[J]. 数据分析与知识发现, 2019, 3(8): 62-67.
[6]	张怡文,张臣坤,杨安桔,计成睿,岳丽华. 基于条件型游走的四部图推荐方法^*[J]. 数据分析与知识发现, 2019, 3(4): 117-125.
[7]	陆泉,朱安琪,张霁月,陈静. *中文网络健康社区中的用户信息需求挖掘研究^——以求医网肿瘤板块数据为例**[J]. 数据分析与知识发现, 2019, 3(4): 22-32.
[8]	张涛, 马海群. 一种基于LDA主题模型的政策文本聚类方法研究^*[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
[9]	刘东苏, 霍辰辉. 基于图像特征匹配的推荐模型研究^*[J]. 数据分析与知识发现, 2018, 2(3): 49-59.
[10]	王雪颖, 张紫玄, 王昊, 邓三鸿. 中国农产品品牌评价研究的内容解析^*[J]. 数据分析与知识发现, 2017, 1(7): 13-21.
[11]	官琴, 邓三鸿, 王昊. 中文文本聚类常用停用词表对比研究^*[J]. 数据分析与知识发现, 2017, 1(3): 72-80.
[12]	陈东沂,周子程,蒋盛益,王连喜,吴佳林. 面向企业微博的客户细分框架^*[J]. 现代图书情报技术, 2016, 32(2): 43-51.
[13]	龚凯乐,成颖,孙建军. 基于参与者共现分析的博文聚类研究^*[J]. 现代图书情报技术, 2016, 32(10): 50-58.
[14]	任育伟, 吕学强, 李卓, 徐丽萍. 搜索日志中命名实体识别[J]. 现代图书情报技术, 2015, 31(6): 49-56.
[15]	肖天久, 刘颖. 《红楼梦》词和N元文法分析[J]. 现代图书情报技术, 2015, 31(4): 50-57.

Viewed

Full text

Abstract

Cited

Shared

Discussed