College of Computer Science and Technology, Guizhou University, Guiyang 550025, China State Key Laboratory of Public Big Data, College of Computer Science and Technology,Guizhou University, Guiyang 550025, China
[Objective] This paper tries to address the missing semantics of the existing algorithms for documents clustering. [Methods] Based on the traditional deep variational inference algorithm, we proposed a Semantic Supplemented Variational Text Clustering Model (SSVAE), which could add text semantic information to the clustering process. [Results] The SSVAE effectively addressed the missing semantics issue. Compared with the best existing models, SSVAE’s NMI with the BBC, Reuters-1500, Abstract, Reuters-10k, and 20news-l datasets were improved by 8.92%, 7.43%, 8.73%, 4.80% and 6.14% respectively. [Limitations] During the process of semantic supplementation, the SSVAE inevitable brought in some noises, which posed some impacts on the clustering performance. [Conclusions] The new SSVAE model effectively improves the accuracy of text clustering.
Kingma D P, Welling M. Auto-Encoding Variational Bayes[OL]. arXiv Preprint, arXiv:1312.6114.
[2]
Miao Y S, Yu L, Blunsom P. Neural Variational Inference for Text Processing[C]// Proceedings of the 33rd International Conference on Machine Learning-Volume 48. 2016: 1727-1736.
[3]
Kipf T N, Welling M. Variational Graph Auto-Encoders[OL]. arXiv Preprint, arXiv:1611.07308.
[4]
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
(Zhang Tao, Ma Haiqun. Clustering Policy Texts Based on LDA Topic Model[J]. Data Analysis and Knowledge Discovery, 2018, 2(9): 59-65.)
[6]
Yin J H, Wang J Y. A Dirichlet Multinomial Mixture Model-Based Approach for Short Text Clustering[C]// Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 233-242.
[7]
Blei D, Carin L, Dunson D. Probabilistic Topic Models[J]. IEEE Signal Processing Magazine, 2010, 27(6): 55-65.
[8]
Hinton G E, Salakhutdinov R R. Reducing the Dimensionality of Data with Neural Networks[J]. Science, 2006, 313(5786): 504-507.
doi: 10.1126/science.1127647
pmid: 16873662
[9]
Xie J Y, Girshick R, Farhadi A. Unsupervised Deep Embedding for Clustering Analysis[C]// Proceedings of the 33rd International Conference on Machine Learning-Volume 48. 2016: 478-487.
[10]
Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative Adversarial Nets[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. 2014: 2672-2680.
(Li Feifei, Wu Fan, Wang Zhongqing. Sentiment Analysis with Reviewer Types and Generative Adversarial Network[J]. Data Analysis and Knowledge Discovery, 2021, 5(4): 72-79.)
(Liang Xingxing, Feng Yanghe, Huang Jincai, et al. Novel Deep Reinforcement Learning Algorithm Based on Attention-Based Value Function and Autoregressive Environment Model[J]. Journal of Software, 2020, 31(4): 948-966.)
[14]
An J, Cho S. Variational Autoencoder Based Anomaly Detection Using Reconstruction Probability[J]. Special Lecture on IE, 2015, 2(1): 1-18.
[15]
Higgins I, Matthey L, Pal A, et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework[C]// Proceedings of the 5th International Conference on Learning Representations. 2017.
[16]
Liu Y, Liu Z Y, Chua T S, et al. Topical Word Embeddings[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015: 2418-2424
[17]
Dieng A B, Ruiz F J R, Blei D M. Topic Modeling in Embedding Spaces[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 439-453.
[18]
Blei D M, Jordan M I. Variational Inference for Dirichlet Process Mixtures[J]. Bayesian Analysis, 2006, 1(1): 121-143.
[19]
Liu L Y, Huang H Y, Gao Y, et al. Neural Variational Correlated Topic Modeling[C]// Proceedings of the World Wide Web Conference. 2019: 1142-1152.
[20]
Teh Y W, Jordan M I, Beal M J, et al. Hierarchical Dirichlet Processes[J]. Journal of the American Statistical Association, 2006, 101(476): 1566-1581.
[21]
Miao Y S, Grefenstette E, Blunsom P. Discovering Discrete Latent Topics with Neural Variational Inference[C]// Proceedings of the 34th International Conference on Machine Learning-Volume 70. 2017: 2410-2419.
[22]
Lee H B, Macqueen J B. A K-Means Cluster Analysis Computer Program with Cross-Tabulations and Next-Nearest-Neighbor Analysis[J]. Educational and Psychological Measurement, 1980, 40(1): 133-138.
[23]
Maaten L, Hinton G. Visualizing Data Using t-SNE[J]. Journal of Machine Learning Research, 2008, 9: 2579-2605.
[24]
Christian H, Agus M P, Suhartono D. Single Document Automatic Text Summarization Using Term Frequency-Inverse Document Frequency (TF-IDF)[J]. ComTech: Computer, Mathematics and Engineering Applications, 2016, 7(4): 285-294.
[25]
Greene D, Cunningham P. Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering[C]// Proceedings of the 23rd International Conference on Machine Learning. 2006: 377-384.
[26]
Zhong S. Semi-Supervised Model-Based Document Clustering: A Comparative Study[J]. Machine Learning, 2006, 65(1): 3-29.