Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (9): 27-35    DOI: 10.11925/infotech.2096-3467.2018.1259
     研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于均值漂移算法的文本聚类数目优化研究 *
赵华茗(),余丽,周强
中国科学院文献情报中心 北京 100190
Determining Best Text Clustering Number with Mean Shift Algorithm
Huaming Zhao(),Li Yu,Qiang Zhou
National Science Library, Chinese Academy of Sciences, Beijing 100190, China
全文: PDF(706 KB)   HTML ( 14
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】探索最佳文本聚类数目的优化方法, 为提升文本聚类算法的有效性和质量提供参考。【方法】结合TF-IDF和Word2Vec算法, 提取Top N关键词向量作为语料库文本特征表达; 结合均值漂移算法、聚类有效性指标(Silhouette)和均方误差(MSE)指标, 确定最佳文本聚类数目。【结果】Top 4 500关键词向量规模能较好呈现文本特征; 基于均值漂移算法确定的最佳文本聚类数与人工研判优化的聚类数相符。【局限】选取的实验数据集合不够充足, 缺少在其他领域的应用对比。【结论】本文方法可以在无监督方式下高质量完成文本聚类个数的确定。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
赵华茗
余丽
周强
关键词 均值漂移文本聚类聚类数聚类有效性    
Abstract

[Objective] This paper explores the optimal method for determining the best text clustering number, aiming to improve the effectiveness of related algorithms. [Methods] First, we combined the TF-IDF and Word2Vec algorithms to extract the TopN keyword vectors as text feature expression in corpus. Then, we decided the best number of text clustering with the mean shift algorithm, clustering validity index (Silhouette) and mean square error (MSE) index. [Results] We found that the top 4500 keyword vectors could better represent the text features. The best number of text clustering by Mean Shift algorithm matched the manually optimized results. [Limitations] The size of experimental data sets needs to be expanded. Our results should to be compared with those of other applications. [Conclusions] The proposed method could effectively determin the best text clustering number in an unsupervised way.

Key wordsMean Shift    Text Clustering    Number of Clusters    Clustering Validity
收稿日期: 2018-11-13     
中图分类号:  G20 G35  
基金资助:*本文系国家社会科学基金项目“基于开放获取学术期刊的资源深度整合与揭示研究”(项目编号: 16BTQ025);中国科学院文献情报中心文献情报能力建设专项项目“文献情报‘数据湖’及开放式大数据框架建设”(项目编号: 院1852)
引用本文:   
赵华茗,余丽,周强. 基于均值漂移算法的文本聚类数目优化研究 *[J]. 数据分析与知识发现, 2019, 3(9): 27-35.
Huaming Zhao,Li Yu,Qiang Zhou. Determining Best Text Clustering Number with Mean Shift Algorithm. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2018.1259.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.1259
图1  基于均值漂移算法的Top N文本聚类Silhouette指标折线图
Top N q=0.09 q=0.07 q=0.06 q=0.03 q=0.02 q=0.01
K Sil K Sil K Sil K Sil K Sil K Sil
1 000 7 0.445 11 0.428 17 0.412 30 0.415 44 0.420 105 0.425
2 000 6 0.427 12 0.419 17 0.414 27 0.425 43 0.420 90 0.433
3 000 6 0.432 11 0.413 14 0.423 27 0.440 43 0.432 92 0.431
4 000 8 0.411 11 0.415 15 0.407 27 0.442 36 0.437 96 0.396
5 000 7 0.449 11 0.413 15 0.426 24 0.439 35 0.429 82 0.400
6 000 7 0.432 9 0.429 14 0.415 26 0.428 33 0.427 76 0.396
表1  基于均值漂移算法的文本聚类
Top N p =none p =-50 p =-100 p =-1 000
K Sil 耗时(s) K Sil 耗时(s) K Sil 耗时(s) K Sil 耗时(s)
1 000 23 0.419 2.59 129 0.428 0.96 93 0.428 1.43 52 0.416 4.04
2 000 188 0.424 13.64 184 0.435 4.14 137 0.432 6.83 89 0.429 31.59
3 000 977 0.500 29.97 226 0.415 9.55 170 0.417 19.34 210 0.414 71.10
4 000 1 617 0.502 53.70 267 0.409 18.05 198 0.401 43.51 992 0.491 126.03
5 000 2 582 0.457 85.28 311 0.406 22.14 224 0.399 80.41 1 912 0.499 197.15
6 000 2 546 0.490 286.17 346 0.396 33.01 268 0.391 282.23 1 846 0.500 285.15
表2  基于AP算法的文本聚类
图2  不同搜索范围的文本聚类和均方差折线图
图3  文本聚类数目Silhouette指标折线图
[1] 曹晓 . 文本聚类研究综述[J]. 情报探索, 2016(1):131-134.
( Cao Xiao . Review of Researches on Text Clustering[J]. Information Research, 2016(1):131-134.)
[2] Zeng H J, He Q C, Chen Z, et al. Learning to Cluster Web Search Results [C]// Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2004: 210-217.
[3] Cutting D R, Karger D R, Pedersen J O, et al. Scatter/Gather: A Cluster-Based Approach to Browsing Large Document Collections [C]// Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1992: 318-329.
[4] 王小华, 徐宁, 谌志群 . 基于共词分析的文本主题词聚类与主题发现[J]. 情报科学, 2011,29(11):1621-1624.
( Wang Xiaohua, Xu Ning, Chen Zhiqun . Discovering of Subjects and Clustering of Textual Subject Terms Based on Co-Word Analysis[J]. Information Science, 2011,29(11):1621-1624.)
[5] 刘远超, 王晓龙, 徐志明 , 等. 文档聚类综述[J]. 中文信息学报, 2006,20(3):55-62.
( Liu Yuanchao, Wang Xiaolong, Xu Zhiming , et al. A Survey of Document Clustering[J]. Journal of Chinese Information Processing, 2006,20(3):55-62.)
[6] 徐晓旻, 肖仰华 . KBAC: 一种基于K-means的自适应聚类[J]. 小型微型计算机系统, 2012,33(10):2268-2272.
( Xu Xiaomin, Xiao Yanghua . KBAC: K-means Based Adaptive Clustering for Massive Dataset[J]. Journal of Chinese Computer Systems, 2012,33(10):2268-2272.)
[7] Mikolov T, Sutskever I, Chen K , et al. Distributed Representations of Words and Phrases and Their Compositionality[J]. Advances in Neural Information Processing Systems, 2013,26:3111-3119.
[8] 张群, 王红军, 王伦文 . 词向量与LDA相融合的短文本分类方法[J]. 现代图书情报技术, 2016(12):27-35.
( Zhang Qun, Wang Hongjun, Wang Lunwen . Classifying Short Texts with Word Embedding and LDA Model[J]. New Technology of Library and Information Service, 2016(12):27-35.)
[9] 林江豪, 周咏梅, 阳爱民 , 等. 结合词向量和聚类算法的新闻评论话题演进分析[J]. 计算机工程与科学, 2016,38(11):2368-2374.
( Lin Jianghao, Zhou Yongmei, Yang Aimin , et al. Analysis on Topic Evolution of News Comments by Combining Word Vector and Clustering Algorithm[J]. Computer Engineering & Science, 2016,38(11):2368-2374.)
[10] Dai X, Bikdash M, Meyer B. From Social Media to Public Health Surveillance: Word Embedding Based Clustering Method for Twitter Classification [C]// Proceedings of the 2017 SoutheastCon. IEEE, 2017: 1-7.
[11] 张琳, 陈燕, 汲业 , 等. 一种基于密度的K-means算法研究[J]. 计算机应用研究, 2011,28(11):4071-4074.
( Zhang Lin, Chen Yan, Ji Ye , et al. Research on K-means Algorithm Based on Density[J]. Application Research of Computers, 2011,28(11):4071-4074.)
[12] 韩凌波 . K-均值算法中聚类个数优化问题研究[J]. 四川理工学院学报: 自然科学版, 2012,25(2):77-80.
( Han Lingbo . Optimization Study on Class Number of K-means Algorithm[J]. Journal of Sichuan University of Science & Engineering: Natural Sciences Edition, 2012,25(2):77-80.)
[13] 王勇, 唐靖, 饶勤菲 , 等. 高效率的K-means最佳聚类数确定算法[J]. 计算机应用, 2014,34(5):1331-1335.
( Wang Yong, Tang Jing, Rao Qinfei , et al. High Efficient K-means Algorithm for Determining Optimal Number of Clusters[J]. Journal of Computer Applications, 2014,34(5):1331-1335.)
[14] 张忠平, 王爱杰, 柴旭光 . 简单有效的确定聚类数目算法[J]. 计算机工程与应用, 2009,45(15):166-168.
( Zhang Zhongping, Wang Aijie, Chai Xuguang . Easy and Efficient Algorithm to Determine Number of Clusters[J]. Computer Engineering and Applications, 2009,45(15):166-168.)
[15] 周士兵, 徐振源, 唐旭清 . 新的K-均值算法最佳聚类数确定方法[J]. 计算机工程与应用, 2010,46(16):27-31.
( Zhou Shibing, Xu Zhenyuan, Tang Xuqing . New Method for Determining Optimal Number of Clusters in K-means Clustering Algorithm[J]. Computer Engineering and Applications, 2010,46(16):27-31.)
[16] 刘广聪, 黄婷婷, 陈海南 . 改进的二分K均值聚类算法[J]. 计算机应用与软件, 2015,32(2):261-263.
( Liu Guangcong, Huang Tingting, Chen Hainan . Improved Bisecting K-Means Clustering Algorithm[J]. Computer Applications and Software, 2015,32(2):261-263.)
[17] Salton G, Buckley C . Term-Weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1987,24(5):513-523.
[18] Hinton G E. Learning Distributed Representations of Concepts [C]// Proceeding of the 8th Annual Conference of the Cognitive Science Society. 1986: 1-12.
[19] Bengio Y, Ducharme R, Vincent P , et al. A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003,3(6):1137-1155.
[20] 熊富林, 邓怡豪, 唐晓晟 . Word2Vec的核心架构及其应用[J]. 南京师范大学学报: 工程技术版, 2015,15(1):43-48.
( Xiong Fulin, Deng Yihao, Tang Xiaosheng . The Architecture of Word2Vec and Its Application[J]. Journal of Nanjing Normal University: Engineering and Technology Edition, 2015,15(1):43-48.)
[21] Hinton G E . Visualizing High-Dimensional Data Using t-SNE[J]. Vigiliae Christianae, 2008,9(2):2579-2605.
[22] Fukunaga K, Hostetler L . The Estimation of the Gradient of a Density Function, with Applications in Pattern Recognition[J]. IEEE Transactions on Information Theory, 1975,21(1):32-40.
[23] Cheng Y . Mean Shift, Mode Seeking, and Clustering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1995,17(8):790-799.
[24] Comaniciu D, Ramesh V, Del Bue A. Multivariate Saddle Point Detection for Statistical Clustering [C]// Proceedings of the 2002 European Conference on Computer Vision, Copenhagen, Denmark. 2002: 561-576.
[25] Georgescu B, Shimshoni I, Meer P. Mean Shift Based Clustering in High Dimensions: A Texture Classification Example [C]// Proceedings of the 9th IEEE International Conference on Computer Vision. 2003: 456.
[26] Comaniciu D . An Algorithm for Data-Driven Bandwidth Selection[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2003,25(2):281-288.
[27] Dudoit S, Fridlyand J . A Prediction-Based Resampling Method for Estimating the Number of Clusters in a Dataset[J]. Genome Biology, 2002, 3(7): Article Number: Research0036. 1.
[28] Calinski T, Harabasz J . A Dendrite Method for Cluster Analysis[J]. Communications in Statistics, 1974,3(1):1-27.
[29] Dimitriadou E, Dolničar S, Weingessel A . An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets[J]. Psychometrika, 2002,67(1):137-159.
doi: 10.1007/BF02294713
[30] Kapp A V, Tibshirani R . Are Clusters Found in One Dataset Present in Another Dataset?[J]. Biostatistics, 2007,8(1):9-31.
[31] 杨善林, 李永森, 胡笑旋 , 等. K-MEANS算法中的K值优化问题研究[J]. 系统工程理论与实践, 2006(2):99-103.
( Yang Shanlin, Li Yongsen, Hu Xiaoxuan , et al. Optimization Study on K Value of K-means Algorithm[J]. Systems Engineering- Theory & Practice, 2006(2):99-103.)
[32] 于剑, 程乾生 . 模糊聚类方法中的最佳聚类数的搜索范围[J]. 中国科学: 技术科学, 2002,32(2):274-280.
( Yu Jian, Cheng Qiansheng . The Search Scope of the Best Clustering Number in Fuzzy Clustering Method[J]. Scientia Sinica (Technologica), 2002,32(2):274-280.)
[33] Frey B J, Dueck D . Clustering by Passing Messages Between Data Points[J]. Science, 2007,315(5814):972-976.
[34] Brusco M J, Köhn H F . Comment on “Clustering by Passing Messages Between Data Points”[J]. Science, 2008,319(5864):726.
[1] 陆泉,朱安琪,张霁月,陈静. 中文网络健康社区中的用户信息需求挖掘研究*——以求医网肿瘤板块数据为例[J]. 数据分析与知识发现, 2019, 3(4): 22-32.
[2] 张涛,马海群. 一种基于LDA主题模型的政策文本聚类方法研究*[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
[3] 官琴, 邓三鸿, 王昊. 中文文本聚类常用停用词表对比研究*[J]. 数据分析与知识发现, 2017, 1(3): 72-80.
[4] 陈东沂,周子程,蒋盛益,王连喜,吴佳林. 面向企业微博的客户细分框架*[J]. 现代图书情报技术, 2016, 32(2): 43-51.
[5] 龚凯乐,成颖,孙建军. 基于参与者共现分析的博文聚类研究*[J]. 现代图书情报技术, 2016, 32(10): 50-58.
[6] 赵华茗. 分布式环境下的文本聚类研究与实现[J]. 现代图书情报技术, 2015, 31(1): 82-88.
[7] 顾晓雪, 章成志. 结合内容和标签的Web文本聚类研究[J]. 现代图书情报技术, 2014, 30(11): 45-52.
[8] 许鑫, 洪韵佳. 专题知识库中文本聚类结果的可视化研究——以中华烹饪文化知识库为例[J]. 现代图书情报技术, 2014, 30(10): 25-32.
[9] 邓三鸿,万接喜,王昊,刘喜文. 基于特征翻译和潜在语义标引的跨语言文本聚类实验分析*[J]. 现代图书情报技术, 2014, 30(1): 28-35.
[10] 赵辉, 刘怀亮. 面向用户生成内容的短文本聚类算法研究[J]. 现代图书情报技术, 2013, 29(9): 88-92.
[11] 何文静, 何琳. 基于社会标签的文本聚类研究[J]. 现代图书情报技术, 2013, 29(7/8): 49-54.
[12] 洪韵佳, 许鑫. 基于领域本体的知识库多层次文本聚类研究——以中华烹饪文化知识库为例[J]. 现代图书情报技术, 2013, (12): 19-26.
[13] 边鹏, 赵妍, 苏玉召. 一种改进的K-means算法最佳聚类数确定方法[J]. 现代图书情报技术, 2011, 27(9): 34-40.
[14] 章成志,王惠临. 多语言文本聚类研究综述*[J]. 现代图书情报技术, 2009, 25(6): 31-36.
[15] 王伟,许鑫. 基于聚类的网络舆情热点发现及分析*[J]. 现代图书情报技术, 2009, 3(3): 74-79.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn