Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (6): 71-83     https://doi.org/10.11925/infotech.2096-3467.2021.1212
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
SSVAE:一种补充语义信息的深度变分文本聚类模型*
薛菁菁,秦永彬(),黄瑞章,任丽娜,陈艳平
贵州大学计算机科学与技术学院 贵阳 550025
公共大数据国家重点实验室 贵州大学计算机科学与技术学院 贵阳 550025
SSVAE: A Deep Variational Text Clustering Model with Semantic Supplementation
Xue Jingjing,Qin Yongbin(),Huang Ruizhang,Ren Lina,Chen Yanping
College of Computer Science and Technology, Guizhou University, Guiyang 550025, China
State Key Laboratory of Public Big Data, College of Computer Science and Technology,Guizhou University, Guiyang 550025, China
全文: PDF (1878 KB)   HTML ( 13
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 解决现有深度变分推断算法进行文本聚类时面临的语义缺失问题。【方法】 基于现有的深度变分推断算法,设计一种补充语义信息的深度文本聚类模型(SSVAE),可以将文本语义信息补充到聚类过程中。【结果】 实验结果表明,SSVAE在文本聚类过程中有效地补充了文本缺失的语义信息,与现有效果最好的深度变分推断模型以及主流的深度聚类模型相比,SSVAE的NMI指标在BBC,Reuters-1500,Abstract,Reuters-10k,20news-l这5个真实文本数据集上分别提升8.92、7.43、8.73、4.80和6.14个百分点。【局限】 SSVAE在补充语义的过程中,除了补充了缺失的语义,有时也不可避免地引入一些噪声,这会造成聚类效果的微小偏差。【结论】 补充语义信息的深度变分文本聚类模型SSVAE能够对文本进行更有效的聚类划分,提高聚类准确性。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
薛菁菁
秦永彬
黄瑞章
任丽娜
陈艳平
关键词 文本聚类语义缺失语义补充深度变分推断    
Abstract

[Objective] This paper tries to address the missing semantics of the existing algorithms for documents clustering. [Methods] Based on the traditional deep variational inference algorithm, we proposed a Semantic Supplemented Variational Text Clustering Model (SSVAE), which could add text semantic information to the clustering process. [Results] The SSVAE effectively addressed the missing semantics issue. Compared with the best existing models, SSVAE’s NMI with the BBC, Reuters-1500, Abstract, Reuters-10k, and 20news-l datasets were improved by 8.92%, 7.43%, 8.73%, 4.80% and 6.14% respectively. [Limitations] During the process of semantic supplementation, the SSVAE inevitable brought in some noises, which posed some impacts on the clustering performance. [Conclusions] The new SSVAE model effectively improves the accuracy of text clustering.

Key wordsText Clustering    Semantic Loss    Semantic Supplementation    Deep Variational Inference
收稿日期: 2021-10-24      出版日期: 2022-07-28
ZTFLH:  TP391  
基金资助:*国家自然科学基金通用联合基金重点项目(U1836205);国家自然科学基金项目(62066007);国家自然科学基金项目(62066008)
通讯作者: 秦永彬, ORCID:0000-0002-1960-8628     E-mail: ybqin@foxmail.com
引用本文:   
薛菁菁, 秦永彬, 黄瑞章, 任丽娜, 陈艳平. SSVAE:一种补充语义信息的深度变分文本聚类模型*[J]. 数据分析与知识发现, 2022, 6(6): 71-83.
Xue Jingjing, Qin Yongbin, Huang Ruizhang, Ren Lina, Chen Yanping. SSVAE: A Deep Variational Text Clustering Model with Semantic Supplementation. Data Analysis and Knowledge Discovery, 2022, 6(6): 71-83.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.1212      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I6/71
Fig. 1  SSVAE示意图
Fig.2  VAE概率图模型
Fig.3  语义补充模块概率图模型
数据集名称 样本个数 词典大小 输入维度 聚类个数
BBC 2 250 10 000 10 000 5
Reuters-1500 1 500 2 000 2 000 3
Abstract 4 306 10 000 10 000 3
Reuters-10k 10 000 2 000 2 000 4
20news-l 7 025 1 000 1 000 20
Table 1  数据集统计
数据集 评测
指标/%
对比方法 SSVAE
K-Means AE DEC VAE VGAE SSVAE(k=5) SSVAE(k=20) SSVAE(k=50)
BBC Acc 51.58 53.60 70.97 70.23 75.96 80.31 72.63 78.56
NMI 30.88 39.93 58.62 53.04 54.60 67.54 64.91 65.92
ARI 20.50 19.90 52.25 43.04 59.91 59.40 54.51 61.36
Reuters-1500 Acc 41.73 54.33 60.73 56.26 60.05 63.60 64.86 71.26
NMI 4.23 18.81 28.10 26.76 25.51 30.27 31.78 35.53
ARI 2.28 12.21 20.43 17.40 26.18 24.97 29.02 34.81
Abstract Acc 69.18 75.56 85.25 85.58 88.27 86.11 91.22 91.69
NMI 38.26 45.26 57.15 58.73 61.26 57.64 68.72 69.99
ARI 27.69 39.95 61.02 61.86 67.49 62.25 75.26 76.42
Reuters-10k Acc 54.04 55.46 74.90 73.58 75.43 78.65 74.80 75.43
NMI 41.50 25.27 49.69 47.50 50.28 51.17 55.08 46.79
ARI 27.95 26.07 49.55 48.44 51.26 52.89 54.70 50.83
20news-l Acc 12.14 14.12 22.99 12.44 18.59 24.75 23.73 24.52
NMI 11.34 15.12 26.05 10.25 19.63 32.19 31.37 28.77
ARI 0.30 1.20 8.43 3.59 6.14 12.25 9.38 8.62
Table 2  模型性能对比
Fig.4  文本语义表示可视化
数据集类别 模型 特征词
Entertainment SSVAE Kinsei, Leonardo, Scorses, Dicaprio, Babi,Nomine, Academi, Theatr, Art, Aviat,AngelPictur, Favourit, Oscar, Ceremoni,Hollywood, Nomin, Movi, Award,ActorVote, Star,PresentHeldBest, Martin,DirectorFebruariMillionBattl
VAE Depp,Alfr,Scientif,Jude,Neverland,-Johnni, Kinsei, Leonardo, Dicaprio, Scorses, Babi,Closer, Aviat,
Academi, Art, Ceremoni, Theatr, Oscar,Box, Nomin,27, Movi,Tip, Favourit, Star,-2005, Martin, Award,Member,List
Politics SSVAE Fee, Jackson, Constitu,Leadership, 1997, Serv, Howard, Iraq, Blair, Mp, Criticis, Conserv, Labour, Elect, War, Toni,Committe, Stand, Parti,Noth,Campaign, Public,Minist, Prime,Govern, Michael, Member,
Issu, Spokesman,View
VAE Defect,Chosen, Jackson,Servant, Constitu, Fee, 1997,Letter, Howard,Wrote, Serv,Amid, Blair, Prime, Mp, Toni, Conserv, Michael, Elect, War, Labour, Stand, Spokesman, Public, Member, Iraq, Criticis,
Leader,Given, Parti
Sport SSVAE Clash, Anfield, Liverpool, Break, Leagu, Readi, English,Certainli,Coach, Champion, Club, Footbal,Stai, Match,Have, Side,Insist,Sport, England,Team,Cup, Chanc, Difficult,Season,Lot,Game, Far,Feel,
Premiership,Plai
VAE Relish,Enthusiasm,Leverkusen, Anfield,Hodgson,Natur, Readi,Gave, Liverpool,Recov,Bigger, Break,Rest,Explain, Leagu,Ensur, Champion, Footbal, English, Difficult, Club,Todai, Match, England,Europ, Far,Talk,Idea, Chanc, Side
Table 3  重构文本的特征词展示
[1] Kingma D P, Welling M. Auto-Encoding Variational Bayes[OL]. arXiv Preprint, arXiv:1312.6114.
[2] Miao Y S, Yu L, Blunsom P. Neural Variational Inference for Text Processing[C]// Proceedings of the 33rd International Conference on Machine Learning-Volume 48. 2016: 1727-1736.
[3] Kipf T N, Welling M. Variational Graph Auto-Encoders[OL]. arXiv Preprint, arXiv:1611.07308.
[4] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
[5] 张涛, 马海群. 一种基于LDA主题模型的政策文本聚类方法研究[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
[5] (Zhang Tao, Ma Haiqun. Clustering Policy Texts Based on LDA Topic Model[J]. Data Analysis and Knowledge Discovery, 2018, 2(9): 59-65.)
[6] Yin J H, Wang J Y. A Dirichlet Multinomial Mixture Model-Based Approach for Short Text Clustering[C]// Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 233-242.
[7] Blei D, Carin L, Dunson D. Probabilistic Topic Models[J]. IEEE Signal Processing Magazine, 2010, 27(6): 55-65.
[8] Hinton G E, Salakhutdinov R R. Reducing the Dimensionality of Data with Neural Networks[J]. Science, 2006, 313(5786): 504-507.
doi: 10.1126/science.1127647 pmid: 16873662
[9] Xie J Y, Girshick R, Farhadi A. Unsupervised Deep Embedding for Clustering Analysis[C]// Proceedings of the 33rd International Conference on Machine Learning-Volume 48. 2016: 478-487.
[10] Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative Adversarial Nets[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. 2014: 2672-2680.
[11] 李崇轩, 朱军, 张钹. 条件概率图产生式对抗网络[J]. 软件学报, 2020, 31(4): 1002-1008.
[11] (Li Chongxuan, Zhu Jun, Zhang Bo. Conditional Graphical Generative Adversarial Networks[J]. Journal of Software, 2020, 31(4): 1002-1008.)
[12] 李菲菲, 吴璠, 王中卿. 基于生成式对抗网络和评论专业类型的情感分类研究[J]. 数据分析与知识发现, 2021, 5(4): 72-79.
[12] (Li Feifei, Wu Fan, Wang Zhongqing. Sentiment Analysis with Reviewer Types and Generative Adversarial Network[J]. Data Analysis and Knowledge Discovery, 2021, 5(4): 72-79.)
[13] 梁星星, 冯旸赫, 黄金才, 等. 基于自回归预测模型的深度注意力强化学习方法[J]. 软件学报, 2020, 31(4): 948-966.
[13] (Liang Xingxing, Feng Yanghe, Huang Jincai, et al. Novel Deep Reinforcement Learning Algorithm Based on Attention-Based Value Function and Autoregressive Environment Model[J]. Journal of Software, 2020, 31(4): 948-966.)
[14] An J, Cho S. Variational Autoencoder Based Anomaly Detection Using Reconstruction Probability[J]. Special Lecture on IE, 2015, 2(1): 1-18.
[15] Higgins I, Matthey L, Pal A, et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework[C]// Proceedings of the 5th International Conference on Learning Representations. 2017.
[16] Liu Y, Liu Z Y, Chua T S, et al. Topical Word Embeddings[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015: 2418-2424
[17] Dieng A B, Ruiz F J R, Blei D M. Topic Modeling in Embedding Spaces[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 439-453.
[18] Blei D M, Jordan M I. Variational Inference for Dirichlet Process Mixtures[J]. Bayesian Analysis, 2006, 1(1): 121-143.
[19] Liu L Y, Huang H Y, Gao Y, et al. Neural Variational Correlated Topic Modeling[C]// Proceedings of the World Wide Web Conference. 2019: 1142-1152.
[20] Teh Y W, Jordan M I, Beal M J, et al. Hierarchical Dirichlet Processes[J]. Journal of the American Statistical Association, 2006, 101(476): 1566-1581.
[21] Miao Y S, Grefenstette E, Blunsom P. Discovering Discrete Latent Topics with Neural Variational Inference[C]// Proceedings of the 34th International Conference on Machine Learning-Volume 70. 2017: 2410-2419.
[22] Lee H B, Macqueen J B. A K-Means Cluster Analysis Computer Program with Cross-Tabulations and Next-Nearest-Neighbor Analysis[J]. Educational and Psychological Measurement, 1980, 40(1): 133-138.
[23] Maaten L, Hinton G. Visualizing Data Using t-SNE[J]. Journal of Machine Learning Research, 2008, 9: 2579-2605.
[24] Christian H, Agus M P, Suhartono D. Single Document Automatic Text Summarization Using Term Frequency-Inverse Document Frequency (TF-IDF)[J]. ComTech: Computer, Mathematics and Engineering Applications, 2016, 7(4): 285-294.
[25] Greene D, Cunningham P. Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering[C]// Proceedings of the 23rd International Conference on Machine Learning. 2006: 377-384.
[26] Zhong S. Semi-Supervised Model-Based Document Clustering: A Comparative Study[J]. Machine Learning, 2006, 65(1): 3-29.
[1] 赵华茗,余丽,周强. 基于均值漂移算法的文本聚类数目优化研究 *[J]. 数据分析与知识发现, 2019, 3(9): 27-35.
[2] 陆泉,朱安琪,张霁月,陈静. 中文网络健康社区中的用户信息需求挖掘研究*——以求医网肿瘤板块数据为例[J]. 数据分析与知识发现, 2019, 3(4): 22-32.
[3] 张涛, 马海群. 一种基于LDA主题模型的政策文本聚类方法研究*[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
[4] 官琴, 邓三鸿, 王昊. 中文文本聚类常用停用词表对比研究*[J]. 数据分析与知识发现, 2017, 1(3): 72-80.
[5] 陈东沂,周子程,蒋盛益,王连喜,吴佳林. 面向企业微博的客户细分框架*[J]. 现代图书情报技术, 2016, 32(2): 43-51.
[6] 龚凯乐,成颖,孙建军. 基于参与者共现分析的博文聚类研究*[J]. 现代图书情报技术, 2016, 32(10): 50-58.
[7] 赵华茗. 分布式环境下的文本聚类研究与实现[J]. 现代图书情报技术, 2015, 31(1): 82-88.
[8] 顾晓雪, 章成志. 结合内容和标签的Web文本聚类研究[J]. 现代图书情报技术, 2014, 30(11): 45-52.
[9] 许鑫, 洪韵佳. 专题知识库中文本聚类结果的可视化研究——以中华烹饪文化知识库为例[J]. 现代图书情报技术, 2014, 30(10): 25-32.
[10] 邓三鸿,万接喜,王昊,刘喜文. 基于特征翻译和潜在语义标引的跨语言文本聚类实验分析*[J]. 现代图书情报技术, 2014, 30(1): 28-35.
[11] 赵辉, 刘怀亮. 面向用户生成内容的短文本聚类算法研究[J]. 现代图书情报技术, 2013, 29(9): 88-92.
[12] 何文静, 何琳. 基于社会标签的文本聚类研究[J]. 现代图书情报技术, 2013, 29(7/8): 49-54.
[13] 洪韵佳, 许鑫. 基于领域本体的知识库多层次文本聚类研究——以中华烹饪文化知识库为例[J]. 现代图书情报技术, 2013, (12): 19-26.
[14] 边鹏, 赵妍, 苏玉召. 一种改进的K-means算法最佳聚类数确定方法[J]. 现代图书情报技术, 2011, 27(9): 34-40.
[15] 章成志,王惠临. 多语言文本聚类研究综述*[J]. 现代图书情报技术, 2009, 25(6): 31-36.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn