Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (6): 71-83    DOI: 10.11925/infotech.2096-3467.2021.1212
Current Issue | Archive | Adv Search |
SSVAE: A Deep Variational Text Clustering Model with Semantic Supplementation
Xue Jingjing,Qin Yongbin(),Huang Ruizhang,Ren Lina,Chen Yanping
College of Computer Science and Technology, Guizhou University, Guiyang 550025, China
State Key Laboratory of Public Big Data, College of Computer Science and Technology,Guizhou University, Guiyang 550025, China
Download: PDF (1878 KB)   HTML ( 13
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to address the missing semantics of the existing algorithms for documents clustering. [Methods] Based on the traditional deep variational inference algorithm, we proposed a Semantic Supplemented Variational Text Clustering Model (SSVAE), which could add text semantic information to the clustering process. [Results] The SSVAE effectively addressed the missing semantics issue. Compared with the best existing models, SSVAE’s NMI with the BBC, Reuters-1500, Abstract, Reuters-10k, and 20news-l datasets were improved by 8.92%, 7.43%, 8.73%, 4.80% and 6.14% respectively. [Limitations] During the process of semantic supplementation, the SSVAE inevitable brought in some noises, which posed some impacts on the clustering performance. [Conclusions] The new SSVAE model effectively improves the accuracy of text clustering.

Key wordsText Clustering      Semantic Loss      Semantic Supplementation      Deep Variational Inference     
Received: 24 October 2021      Published: 28 July 2022
ZTFLH:  TP391  
Fund:National Natural Science Foundation of China(U1836205);National Natural Science Foundation of China(62066007);National Natural Science Foundation of China(62066008)
Corresponding Authors: Qin Yongbin     E-mail: ybqin@foxmail.com

Cite this article:

Xue Jingjing, Qin Yongbin, Huang Ruizhang, Ren Lina, Chen Yanping. SSVAE: A Deep Variational Text Clustering Model with Semantic Supplementation. Data Analysis and Knowledge Discovery, 2022, 6(6): 71-83.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.1212     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I6/71

Schematic Diagram of SSVAE
The Probability Graph Model of VAE
The Probability Graph Model of Semantic Supplement Module
数据集名称 样本个数 词典大小 输入维度 聚类个数
BBC 2 250 10 000 10 000 5
Reuters-1500 1 500 2 000 2 000 3
Abstract 4 306 10 000 10 000 3
Reuters-10k 10 000 2 000 2 000 4
20news-l 7 025 1 000 1 000 20
Statistics of Data Set
数据集 评测
指标/%
对比方法 SSVAE
K-Means AE DEC VAE VGAE SSVAE(k=5) SSVAE(k=20) SSVAE(k=50)
BBC Acc 51.58 53.60 70.97 70.23 75.96 80.31 72.63 78.56
NMI 30.88 39.93 58.62 53.04 54.60 67.54 64.91 65.92
ARI 20.50 19.90 52.25 43.04 59.91 59.40 54.51 61.36
Reuters-1500 Acc 41.73 54.33 60.73 56.26 60.05 63.60 64.86 71.26
NMI 4.23 18.81 28.10 26.76 25.51 30.27 31.78 35.53
ARI 2.28 12.21 20.43 17.40 26.18 24.97 29.02 34.81
Abstract Acc 69.18 75.56 85.25 85.58 88.27 86.11 91.22 91.69
NMI 38.26 45.26 57.15 58.73 61.26 57.64 68.72 69.99
ARI 27.69 39.95 61.02 61.86 67.49 62.25 75.26 76.42
Reuters-10k Acc 54.04 55.46 74.90 73.58 75.43 78.65 74.80 75.43
NMI 41.50 25.27 49.69 47.50 50.28 51.17 55.08 46.79
ARI 27.95 26.07 49.55 48.44 51.26 52.89 54.70 50.83
20news-l Acc 12.14 14.12 22.99 12.44 18.59 24.75 23.73 24.52
NMI 11.34 15.12 26.05 10.25 19.63 32.19 31.37 28.77
ARI 0.30 1.20 8.43 3.59 6.14 12.25 9.38 8.62
Performance Comparison of Models
The Visualization of Text Semantic Representation
数据集类别 模型 特征词
Entertainment SSVAE Kinsei, Leonardo, Scorses, Dicaprio, Babi,Nomine, Academi, Theatr, Art, Aviat,AngelPictur, Favourit, Oscar, Ceremoni,Hollywood, Nomin, Movi, Award,ActorVote, Star,PresentHeldBest, Martin,DirectorFebruariMillionBattl
VAE Depp,Alfr,Scientif,Jude,Neverland,-Johnni, Kinsei, Leonardo, Dicaprio, Scorses, Babi,Closer, Aviat,
Academi, Art, Ceremoni, Theatr, Oscar,Box, Nomin,27, Movi,Tip, Favourit, Star,-2005, Martin, Award,Member,List
Politics SSVAE Fee, Jackson, Constitu,Leadership, 1997, Serv, Howard, Iraq, Blair, Mp, Criticis, Conserv, Labour, Elect, War, Toni,Committe, Stand, Parti,Noth,Campaign, Public,Minist, Prime,Govern, Michael, Member,
Issu, Spokesman,View
VAE Defect,Chosen, Jackson,Servant, Constitu, Fee, 1997,Letter, Howard,Wrote, Serv,Amid, Blair, Prime, Mp, Toni, Conserv, Michael, Elect, War, Labour, Stand, Spokesman, Public, Member, Iraq, Criticis,
Leader,Given, Parti
Sport SSVAE Clash, Anfield, Liverpool, Break, Leagu, Readi, English,Certainli,Coach, Champion, Club, Footbal,Stai, Match,Have, Side,Insist,Sport, England,Team,Cup, Chanc, Difficult,Season,Lot,Game, Far,Feel,
Premiership,Plai
VAE Relish,Enthusiasm,Leverkusen, Anfield,Hodgson,Natur, Readi,Gave, Liverpool,Recov,Bigger, Break,Rest,Explain, Leagu,Ensur, Champion, Footbal, English, Difficult, Club,Todai, Match, England,Europ, Far,Talk,Idea, Chanc, Side
Key Word Display of Reconstructed Text Documents
[1] Kingma D P, Welling M. Auto-Encoding Variational Bayes[OL]. arXiv Preprint, arXiv:1312.6114.
[2] Miao Y S, Yu L, Blunsom P. Neural Variational Inference for Text Processing[C]// Proceedings of the 33rd International Conference on Machine Learning-Volume 48. 2016: 1727-1736.
[3] Kipf T N, Welling M. Variational Graph Auto-Encoders[OL]. arXiv Preprint, arXiv:1611.07308.
[4] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
[5] 张涛, 马海群. 一种基于LDA主题模型的政策文本聚类方法研究[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
[5] (Zhang Tao, Ma Haiqun. Clustering Policy Texts Based on LDA Topic Model[J]. Data Analysis and Knowledge Discovery, 2018, 2(9): 59-65.)
[6] Yin J H, Wang J Y. A Dirichlet Multinomial Mixture Model-Based Approach for Short Text Clustering[C]// Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 233-242.
[7] Blei D, Carin L, Dunson D. Probabilistic Topic Models[J]. IEEE Signal Processing Magazine, 2010, 27(6): 55-65.
[8] Hinton G E, Salakhutdinov R R. Reducing the Dimensionality of Data with Neural Networks[J]. Science, 2006, 313(5786): 504-507.
doi: 10.1126/science.1127647 pmid: 16873662
[9] Xie J Y, Girshick R, Farhadi A. Unsupervised Deep Embedding for Clustering Analysis[C]// Proceedings of the 33rd International Conference on Machine Learning-Volume 48. 2016: 478-487.
[10] Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative Adversarial Nets[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. 2014: 2672-2680.
[11] 李崇轩, 朱军, 张钹. 条件概率图产生式对抗网络[J]. 软件学报, 2020, 31(4): 1002-1008.
[11] (Li Chongxuan, Zhu Jun, Zhang Bo. Conditional Graphical Generative Adversarial Networks[J]. Journal of Software, 2020, 31(4): 1002-1008.)
[12] 李菲菲, 吴璠, 王中卿. 基于生成式对抗网络和评论专业类型的情感分类研究[J]. 数据分析与知识发现, 2021, 5(4): 72-79.
[12] (Li Feifei, Wu Fan, Wang Zhongqing. Sentiment Analysis with Reviewer Types and Generative Adversarial Network[J]. Data Analysis and Knowledge Discovery, 2021, 5(4): 72-79.)
[13] 梁星星, 冯旸赫, 黄金才, 等. 基于自回归预测模型的深度注意力强化学习方法[J]. 软件学报, 2020, 31(4): 948-966.
[13] (Liang Xingxing, Feng Yanghe, Huang Jincai, et al. Novel Deep Reinforcement Learning Algorithm Based on Attention-Based Value Function and Autoregressive Environment Model[J]. Journal of Software, 2020, 31(4): 948-966.)
[14] An J, Cho S. Variational Autoencoder Based Anomaly Detection Using Reconstruction Probability[J]. Special Lecture on IE, 2015, 2(1): 1-18.
[15] Higgins I, Matthey L, Pal A, et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework[C]// Proceedings of the 5th International Conference on Learning Representations. 2017.
[16] Liu Y, Liu Z Y, Chua T S, et al. Topical Word Embeddings[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015: 2418-2424
[17] Dieng A B, Ruiz F J R, Blei D M. Topic Modeling in Embedding Spaces[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 439-453.
[18] Blei D M, Jordan M I. Variational Inference for Dirichlet Process Mixtures[J]. Bayesian Analysis, 2006, 1(1): 121-143.
[19] Liu L Y, Huang H Y, Gao Y, et al. Neural Variational Correlated Topic Modeling[C]// Proceedings of the World Wide Web Conference. 2019: 1142-1152.
[20] Teh Y W, Jordan M I, Beal M J, et al. Hierarchical Dirichlet Processes[J]. Journal of the American Statistical Association, 2006, 101(476): 1566-1581.
[21] Miao Y S, Grefenstette E, Blunsom P. Discovering Discrete Latent Topics with Neural Variational Inference[C]// Proceedings of the 34th International Conference on Machine Learning-Volume 70. 2017: 2410-2419.
[22] Lee H B, Macqueen J B. A K-Means Cluster Analysis Computer Program with Cross-Tabulations and Next-Nearest-Neighbor Analysis[J]. Educational and Psychological Measurement, 1980, 40(1): 133-138.
[23] Maaten L, Hinton G. Visualizing Data Using t-SNE[J]. Journal of Machine Learning Research, 2008, 9: 2579-2605.
[24] Christian H, Agus M P, Suhartono D. Single Document Automatic Text Summarization Using Term Frequency-Inverse Document Frequency (TF-IDF)[J]. ComTech: Computer, Mathematics and Engineering Applications, 2016, 7(4): 285-294.
[25] Greene D, Cunningham P. Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering[C]// Proceedings of the 23rd International Conference on Machine Learning. 2006: 377-384.
[26] Zhong S. Semi-Supervised Model-Based Document Clustering: A Comparative Study[J]. Machine Learning, 2006, 65(1): 3-29.
[1] Huaming Zhao,Li Yu,Qiang Zhou. Determining Best Text Clustering Number with Mean Shift Algorithm[J]. 数据分析与知识发现, 2019, 3(9): 27-35.
[2] Quan Lu,Anqi Zhu,Jiyue Zhang,Jing Chen. Research on User Information Requirement in Chinese Network Health Community: Taking Tumor-forum Data of Qiuyi as an Example[J]. 数据分析与知识发现, 2019, 3(4): 22-32.
[3] Zhang Tao,Ma Haiqun. Clustering Policy Texts Based on LDA Topic Model[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
[4] Guan Qin,Deng Sanhong,Wang Hao. Chinese Stopwords for Text Clustering: A Comparative Study[J]. 数据分析与知识发现, 2017, 1(3): 72-80.
[5] Chen Dongyi,Zhou Zicheng,Jiang Shengyi,Wang Lianxi,Wu Jialin. A Framework for Customer Segmentation on Enterprises’ Microblog[J]. 现代图书情报技术, 2016, 32(2): 43-51.
[6] Gong Kaile,Cheng Ying,Sun Jianjun. Clustering Blog Posts with Co-occurrence Analysis[J]. 现代图书情报技术, 2016, 32(10): 50-58.
[7] Gu Xiaoxue, Zhang Chengzhi. Using Content and Tags for Web Text Clustering[J]. 现代图书情报技术, 2014, 30(11): 45-52.
[8] Xu Xin, Hong Yunjia. Study on Text Visualization of Clustering Result for Domain Knowledge Base —— Take Knowledge Base of Chinese Cuisine Culture as the Object[J]. 现代图书情报技术, 2014, 30(10): 25-32.
[9] Deng Sanhong,Wan Jiexi,Wang Hao,Liu Xiwen. Experimental Study of Multilingual Text Clustering[J]. 现代图书情报技术, 2014, 30(1): 28-35.
[10] Zhao Hui, Liu Huailiang. Research on Short Text Clustering Algorithm for User Generated Content[J]. 现代图书情报技术, 2013, 29(9): 88-92.
[11] He Wenjing, He Lin. Research on Text Clustering Based on Social Tagging[J]. 现代图书情报技术, 2013, 29(7/8): 49-54.
[12] Hong Yunjia, Xu Xin. Study on Multi-level Text Clustering for Knowledge Base Based on Domain Ontology——Taking Knowledge Base of Chinese Cuisine Culture as an Example[J]. 现代图书情报技术, 2013, (12): 19-26.
[13] Bian Peng, Zhao Yan, Su Yuzhao. An Improved Method for Determining Optimal Number of Clusters in K-means Clustering Algorithm[J]. 现代图书情报技术, 2011, 27(9): 34-40.
[14] Rao Yanghui,Ye Liang,Cheng Jie. Research on the Application of WordNet in Text Clustering[J]. 现代图书情报技术, 2009, (10): 67-70.
[15] Lu Guoli,Wang Xiaohua,Wang Rongbo. Text Clustering Research on the Max Term Contribution Dimension Reduction and Simulated Annealing Algorithm[J]. 现代图书情报技术, 2008, 24(12): 43-47.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn