[Objective] This paper explores the optimal method for determining the best text clustering number, aiming to improve the effectiveness of related algorithms. [Methods] First, we combined the TF-IDF and Word2Vec algorithms to extract the TopN keyword vectors as text feature expression in corpus. Then, we decided the best number of text clustering with the mean shift algorithm, clustering validity index (Silhouette) and mean square error (MSE) index. [Results] We found that the top 4500 keyword vectors could better represent the text features. The best number of text clustering by Mean Shift algorithm matched the manually optimized results. [Limitations] The size of experimental data sets needs to be expanded. Our results should to be compared with those of other applications. [Conclusions] The proposed method could effectively determin the best text clustering number in an unsupervised way.
赵华茗,余丽,周强. 基于均值漂移算法的文本聚类数目优化研究 *[J]. 数据分析与知识发现, 2019, 3(9): 27-35.
Huaming Zhao,Li Yu,Qiang Zhou. Determining Best Text Clustering Number with Mean Shift Algorithm. Data Analysis and Knowledge Discovery, 2019, 3(9): 27-35.
( Cao Xiao . Review of Researches on Text Clustering[J]. Information Research, 2016(1):131-134.)
[2]
Zeng H J, He Q C, Chen Z, et al. Learning to Cluster Web Search Results [C]// Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2004: 210-217.
[3]
Cutting D R, Karger D R, Pedersen J O, et al. Scatter/Gather: A Cluster-Based Approach to Browsing Large Document Collections [C]// Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1992: 318-329.
( Wang Xiaohua, Xu Ning, Chen Zhiqun . Discovering of Subjects and Clustering of Textual Subject Terms Based on Co-Word Analysis[J]. Information Science, 2011,29(11):1621-1624.)
( Xu Xiaomin, Xiao Yanghua . KBAC: K-means Based Adaptive Clustering for Massive Dataset[J]. Journal of Chinese Computer Systems, 2012,33(10):2268-2272.)
[7]
Mikolov T, Sutskever I, Chen K , et al. Distributed Representations of Words and Phrases and Their Compositionality[J]. Advances in Neural Information Processing Systems, 2013,26:3111-3119.
( Zhang Qun, Wang Hongjun, Wang Lunwen . Classifying Short Texts with Word Embedding and LDA Model[J]. New Technology of Library and Information Service, 2016(12):27-35.)
( Lin Jianghao, Zhou Yongmei, Yang Aimin , et al. Analysis on Topic Evolution of News Comments by Combining Word Vector and Clustering Algorithm[J]. Computer Engineering & Science, 2016,38(11):2368-2374.)
[10]
Dai X, Bikdash M, Meyer B. From Social Media to Public Health Surveillance: Word Embedding Based Clustering Method for Twitter Classification [C]// Proceedings of the 2017 SoutheastCon. IEEE, 2017: 1-7.
( Han Lingbo . Optimization Study on Class Number of K-means Algorithm[J]. Journal of Sichuan University of Science & Engineering: Natural Sciences Edition, 2012,25(2):77-80.)
( Wang Yong, Tang Jing, Rao Qinfei , et al. High Efficient K-means Algorithm for Determining Optimal Number of Clusters[J]. Journal of Computer Applications, 2014,34(5):1331-1335.)
( Zhang Zhongping, Wang Aijie, Chai Xuguang . Easy and Efficient Algorithm to Determine Number of Clusters[J]. Computer Engineering and Applications, 2009,45(15):166-168.)
( Zhou Shibing, Xu Zhenyuan, Tang Xuqing . New Method for Determining Optimal Number of Clusters in K-means Clustering Algorithm[J]. Computer Engineering and Applications, 2010,46(16):27-31.)
( Liu Guangcong, Huang Tingting, Chen Hainan . Improved Bisecting K-Means Clustering Algorithm[J]. Computer Applications and Software, 2015,32(2):261-263.)
[17]
Salton G, Buckley C . Term-Weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1987,24(5):513-523.
[18]
Hinton G E. Learning Distributed Representations of Concepts [C]// Proceeding of the 8th Annual Conference of the Cognitive Science Society. 1986: 1-12.
[19]
Bengio Y, Ducharme R, Vincent P , et al. A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003,3(6):1137-1155.
( Xiong Fulin, Deng Yihao, Tang Xiaosheng . The Architecture of Word2Vec and Its Application[J]. Journal of Nanjing Normal University: Engineering and Technology Edition, 2015,15(1):43-48.)
[21]
Hinton G E . Visualizing High-Dimensional Data Using t-SNE[J]. Vigiliae Christianae, 2008,9(2):2579-2605.
[22]
Fukunaga K, Hostetler L . The Estimation of the Gradient of a Density Function, with Applications in Pattern Recognition[J]. IEEE Transactions on Information Theory, 1975,21(1):32-40.
[23]
Cheng Y . Mean Shift, Mode Seeking, and Clustering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1995,17(8):790-799.
[24]
Comaniciu D, Ramesh V, Del Bue A. Multivariate Saddle Point Detection for Statistical Clustering [C]// Proceedings of the 2002 European Conference on Computer Vision, Copenhagen, Denmark. 2002: 561-576.
[25]
Georgescu B, Shimshoni I, Meer P. Mean Shift Based Clustering in High Dimensions: A Texture Classification Example [C]// Proceedings of the 9th IEEE International Conference on Computer Vision. 2003: 456.
[26]
Comaniciu D . An Algorithm for Data-Driven Bandwidth Selection[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2003,25(2):281-288.
[27]
Dudoit S, Fridlyand J . A Prediction-Based Resampling Method for Estimating the Number of Clusters in a Dataset[J]. Genome Biology, 2002, 3(7): Article Number: Research0036. 1.
[28]
Calinski T, Harabasz J . A Dendrite Method for Cluster Analysis[J]. Communications in Statistics, 1974,3(1):1-27.
[29]
Dimitriadou E, Dolničar S, Weingessel A . An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets[J]. Psychometrika, 2002,67(1):137-159.
doi: 10.1007/BF02294713
[30]
Kapp A V, Tibshirani R . Are Clusters Found in One Dataset Present in Another Dataset?[J]. Biostatistics, 2007,8(1):9-31.
( Yang Shanlin, Li Yongsen, Hu Xiaoxuan , et al. Optimization Study on K Value of K-means Algorithm[J]. Systems Engineering- Theory & Practice, 2006(2):99-103.)
( Yu Jian, Cheng Qiansheng . The Search Scope of the Best Clustering Number in Fuzzy Clustering Method[J]. Scientia Sinica (Technologica), 2002,32(2):274-280.)
[33]
Frey B J, Dueck D . Clustering by Passing Messages Between Data Points[J]. Science, 2007,315(5814):972-976.
[34]
Brusco M J, Köhn H F . Comment on “Clustering by Passing Messages Between Data Points”[J]. Science, 2008,319(5864):726.