|
|
SSVAE: A Deep Variational Text Clustering Model with Semantic Supplementation |
Xue Jingjing,Qin Yongbin( ),Huang Ruizhang,Ren Lina,Chen Yanping |
College of Computer Science and Technology, Guizhou University, Guiyang 550025, China State Key Laboratory of Public Big Data, College of Computer Science and Technology,Guizhou University, Guiyang 550025, China |
|
|
Abstract [Objective] This paper tries to address the missing semantics of the existing algorithms for documents clustering. [Methods] Based on the traditional deep variational inference algorithm, we proposed a Semantic Supplemented Variational Text Clustering Model (SSVAE), which could add text semantic information to the clustering process. [Results] The SSVAE effectively addressed the missing semantics issue. Compared with the best existing models, SSVAE’s NMI with the BBC, Reuters-1500, Abstract, Reuters-10k, and 20news-l datasets were improved by 8.92%, 7.43%, 8.73%, 4.80% and 6.14% respectively. [Limitations] During the process of semantic supplementation, the SSVAE inevitable brought in some noises, which posed some impacts on the clustering performance. [Conclusions] The new SSVAE model effectively improves the accuracy of text clustering.
|
Received: 24 October 2021
Published: 28 July 2022
|
|
Fund:National Natural Science Foundation of China(U1836205);National Natural Science Foundation of China(62066007);National Natural Science Foundation of China(62066008) |
Corresponding Authors:
Qin Yongbin
E-mail: ybqin@foxmail.com
|
[1] |
Kingma D P, Welling M. Auto-Encoding Variational Bayes[OL]. arXiv Preprint, arXiv:1312.6114.
|
[2] |
Miao Y S, Yu L, Blunsom P. Neural Variational Inference for Text Processing[C]// Proceedings of the 33rd International Conference on Machine Learning-Volume 48. 2016: 1727-1736.
|
[3] |
Kipf T N, Welling M. Variational Graph Auto-Encoders[OL]. arXiv Preprint, arXiv:1611.07308.
|
[4] |
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
|
[5] |
张涛, 马海群. 一种基于LDA主题模型的政策文本聚类方法研究[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
|
[5] |
(Zhang Tao, Ma Haiqun. Clustering Policy Texts Based on LDA Topic Model[J]. Data Analysis and Knowledge Discovery, 2018, 2(9): 59-65.)
|
[6] |
Yin J H, Wang J Y. A Dirichlet Multinomial Mixture Model-Based Approach for Short Text Clustering[C]// Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 233-242.
|
[7] |
Blei D, Carin L, Dunson D. Probabilistic Topic Models[J]. IEEE Signal Processing Magazine, 2010, 27(6): 55-65.
|
[8] |
Hinton G E, Salakhutdinov R R. Reducing the Dimensionality of Data with Neural Networks[J]. Science, 2006, 313(5786): 504-507.
doi: 10.1126/science.1127647
pmid: 16873662
|
[9] |
Xie J Y, Girshick R, Farhadi A. Unsupervised Deep Embedding for Clustering Analysis[C]// Proceedings of the 33rd International Conference on Machine Learning-Volume 48. 2016: 478-487.
|
[10] |
Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative Adversarial Nets[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. 2014: 2672-2680.
|
[11] |
李崇轩, 朱军, 张钹. 条件概率图产生式对抗网络[J]. 软件学报, 2020, 31(4): 1002-1008.
|
[11] |
(Li Chongxuan, Zhu Jun, Zhang Bo. Conditional Graphical Generative Adversarial Networks[J]. Journal of Software, 2020, 31(4): 1002-1008.)
|
[12] |
李菲菲, 吴璠, 王中卿. 基于生成式对抗网络和评论专业类型的情感分类研究[J]. 数据分析与知识发现, 2021, 5(4): 72-79.
|
[12] |
(Li Feifei, Wu Fan, Wang Zhongqing. Sentiment Analysis with Reviewer Types and Generative Adversarial Network[J]. Data Analysis and Knowledge Discovery, 2021, 5(4): 72-79.)
|
[13] |
梁星星, 冯旸赫, 黄金才, 等. 基于自回归预测模型的深度注意力强化学习方法[J]. 软件学报, 2020, 31(4): 948-966.
|
[13] |
(Liang Xingxing, Feng Yanghe, Huang Jincai, et al. Novel Deep Reinforcement Learning Algorithm Based on Attention-Based Value Function and Autoregressive Environment Model[J]. Journal of Software, 2020, 31(4): 948-966.)
|
[14] |
An J, Cho S. Variational Autoencoder Based Anomaly Detection Using Reconstruction Probability[J]. Special Lecture on IE, 2015, 2(1): 1-18.
|
[15] |
Higgins I, Matthey L, Pal A, et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework[C]// Proceedings of the 5th International Conference on Learning Representations. 2017.
|
[16] |
Liu Y, Liu Z Y, Chua T S, et al. Topical Word Embeddings[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015: 2418-2424
|
[17] |
Dieng A B, Ruiz F J R, Blei D M. Topic Modeling in Embedding Spaces[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 439-453.
|
[18] |
Blei D M, Jordan M I. Variational Inference for Dirichlet Process Mixtures[J]. Bayesian Analysis, 2006, 1(1): 121-143.
|
[19] |
Liu L Y, Huang H Y, Gao Y, et al. Neural Variational Correlated Topic Modeling[C]// Proceedings of the World Wide Web Conference. 2019: 1142-1152.
|
[20] |
Teh Y W, Jordan M I, Beal M J, et al. Hierarchical Dirichlet Processes[J]. Journal of the American Statistical Association, 2006, 101(476): 1566-1581.
|
[21] |
Miao Y S, Grefenstette E, Blunsom P. Discovering Discrete Latent Topics with Neural Variational Inference[C]// Proceedings of the 34th International Conference on Machine Learning-Volume 70. 2017: 2410-2419.
|
[22] |
Lee H B, Macqueen J B. A K-Means Cluster Analysis Computer Program with Cross-Tabulations and Next-Nearest-Neighbor Analysis[J]. Educational and Psychological Measurement, 1980, 40(1): 133-138.
|
[23] |
Maaten L, Hinton G. Visualizing Data Using t-SNE[J]. Journal of Machine Learning Research, 2008, 9: 2579-2605.
|
[24] |
Christian H, Agus M P, Suhartono D. Single Document Automatic Text Summarization Using Term Frequency-Inverse Document Frequency (TF-IDF)[J]. ComTech: Computer, Mathematics and Engineering Applications, 2016, 7(4): 285-294.
|
[25] |
Greene D, Cunningham P. Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering[C]// Proceedings of the 23rd International Conference on Machine Learning. 2006: 377-384.
|
[26] |
Zhong S. Semi-Supervised Model-Based Document Clustering: A Comparative Study[J]. Machine Learning, 2006, 65(1): 3-29.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|