|
|
SCCL Text Deep Clustering with Increased Cluster-Level Comparison |
Li Jie1,2,Zhang Zhixiong1,2(),Wang Yufei1,2 |
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China 2Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China |
|
|
Abstract [Objective] This paper proposes a new deep clustering model (ISCCL) for texts based on SCCL, aiming to improve its performance in clustering tasks. [Methods] First, the ISCCL model utilized sentence vector pre-trained models to perform data augmentation and encoding to obtain two sets of augmented representations of the input texts. Then, we added two layers of nonlinear networks to the SCCL model. It reduced the augmented representations to a cluster feature space with dimensions equal to the number of clusters. Third, we constructed positive and negative cluster pairs from the perspective of column space for contrastive learning. It guided the model to explore valuable features for clustering tasks and reduce the impact of false positive samples. [Results] In five benchmark datasets, including AgNews, Biomedical, StackOverflow, 20NewsGroups, and zh10, the clustering accuracy of the ISCCL model reached 88.89%, 48.74%, 78.17%, 56.97%, and 86.42%, respectively, which is an improvement of 0.69% to 2.67% compared to the SCCL model. [Limitations] The dimension of the cluster feature space needs to be pre-set (the same as the clustering number K value). However, it is often difficult to determine the specific cluster number of the original data, and adjustments should be made according to the dataset. [Conclusions] The ISCCL model can effectively extract cluster features and improve the deep clustering performance on texts.
|
Received: 03 March 2023
Published: 28 April 2023
|
|
Fund:National Science and Technology Library Special Project(2023XM42) |
Corresponding Authors:
Zhang Zhixiong,ORCID:0000-0003-1596-7487,E-mail: zhangzhx@mail.las.ac.cn。
|
[1] |
Berkhin P. A Survey of Clustering Data Mining Techniques[M]// Grouping Multidimensional Data. Cham: Springer, 2006:25-72.
|
[2] |
MacQueen J B. Some Methods for Classification and Analysis of Multivariate Observations[C]// Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Volume 1:Statistics. 1967: 281-297.
|
[3] |
Lloyd S P. Least Squares Quantization in PCM[J]. IEEE Transactions on Information Theory, 1982, 28(2): 129-137.
doi: 10.1109/TIT.1982.1056489
|
[4] |
von Luxburg U V. A Tutorial on Spectral Clustering[J]. Statistics and Computing, 2007, 17(4): 395-416.
doi: 10.1007/s11222-007-9033-z
|
[5] |
Ester M, Kriegel H P, Sander J, et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise[C]// Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining(KDD-96). 1996: 226-231.
|
[6] |
Bellman R E. Adaptive Control Processes: A Guided Tour[M]. Princeton, New Jersey: Princeton University Press, 1961.
|
[7] |
Shlens J. A Tutorial on Principal Component Analysis [OL]. arXiv Preprint, arXiv:1404.1100.
|
[8] |
Xie J Y, Girshick R, Farhadi A. Unsupervised Deep Embedding for Clustering Analysis[C]// Proceedings of the 33rd International Conference on Machine Learning-Volume 48. 2016: 478-487.
|
[9] |
Yang J W, Parikh D, Batra D. Joint Unsupervised Learning of Deep Representations and Image Clusters[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016: 5147-5156.
|
[10] |
Guo X F, Gao L, Liu X W, et al. Improved Deep Embedded Clustering with Local Structure Preservation[C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017: 1753-1759.
|
[11] |
Wu L R, Liu Z C, Zang Z L, et al. Generalized Clustering and Multi-Manifold Learning with Geometric Structure Preservation [OL]. arXiv Preprint, arXiv:2009.09590v4.
|
[12] |
Tian F, Gao B, Cui Q, et al. Learning Deep Representations for Graph Clustering[C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2014. DOI: 10.1609/aaai.v28i1.8916.
|
[13] |
Jiang Z X, Zheng Y, Tan H C, et al. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering[C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017: 1965-1972.
|
[14] |
Wu Z R, Xiong Y J, Yu S X, et al. Unsupervised Feature Learning via Non-parametric Instance Discrimination[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 3733-3742.
|
[15] |
Chen T, Kornblith S, Norouzi M, et al. A Simple Framework for Contrastive Learning of Visual Representations [OL]. arXiv Preprint, arXiv:2002.05709.
|
[16] |
Huang J B, Gong S G, Zhu X T. Deep Semantic Clustering by Partition Confidence Maximisation[C]// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 8846-8855.
|
[17] |
Li J N, Zhou P, Xiong C M, et al. Prototypical Contrastive Learning of Unsupervised Representations [OL]. arXiv Preprint, arXiv: 2005.04966.
|
[18] |
Zhang D J, Nan F, Wei X K, et al. Supporting Clustering with Contrastive Learning [OL]. arXiv Preprint, arXiv:2103.12953.
|
[19] |
Gomes R, Krause A, Perona P. Discriminative Clustering by Regularized Information Maximization[C]// Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 1. 2010: 775-783.
|
[20] |
Li Y F, Yang M X, Peng D Z, et al. Twin Contrastive Learning for Online Clustering[J]. International Journal of Computer Vision, 2022, 130(9): 2205-2221.
doi: 10.1007/s11263-022-01639-z
|
[21] |
Li Y F, Hu P, Liu Z T, et al. Contrastive Clustering [OL]. arXiv Preprint, arXiv:2009.09687.
|
[22] |
Song C F, Liu F, Huang Y Z, et al. Auto-Encoder Based Data Clustering[C]//Proceedings of CIARP 2013:Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. 2013: 117-124.
|
[23] |
Yang B, Fu X, Sidiropoulos N D, et al. Towards K-Means-Friendly Spaces: Simultaneous Deep Learning and Clustering[C]// Proceedings of the 34th International Conference on Machine Learning - Volume 70. 2017: 3861-3870.
|
[24] |
Hadifar A, Sterckx L, Demeester T, et al. A Self-Training Approach for Short Text Clustering[C]// Proceedings of the 4th Workshop on Representation Learning for NLP. 2019: 194-199.
|
[25] |
Bo D Y, Wang X, Shi C, et al. Structural Deep Clustering Network[C]// Proceedings of the Web Conference 2020. 2020: 1400-1410.
|
[26] |
Hu W H, Miyato T, Tokui S, et al. Learning Discrete Representations via Information Maximizing Self-Augmented Training [OL]. arXiv Preprint, arXiv:1702.08720.
|
[27] |
van den Oord A, Li Y Z, Vinyals O. Representation Learning with Contrastive Predictive Coding [OL]. arXiv Preprint, arXiv:1807.03748.
|
[28] |
Sohn K. Improved Deep Metric Learning with Multi-class n-Pair Loss Objective[C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016: 1857-1865.
|
[29] |
Yan Y M, Li R M, Wang S R, et al. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:Long Papers). 2021: 5065-5075.
|
[30] |
张俊林. 如何利用噪音数据:对比学习在微博场景的应用[EB/OL].[2023-03-06]. https://mp.weixin.qq.com/s/9N2tk6QTCTuBkrU5Xwb2ow.
|
[30] |
(Zhang Junlin. How to Use Noise Data: The Application of Contrastive Learning in Microblogging Scenarios[EB/OL].[2023-03-06]. https://mp.weixin.qq.com/s/9N2tk6QTCTuBkrU5Xwb2ow.)
|
[31] |
Gao T Y, Yao X C, Chen D Q. SimCSE: Simple Contrastive Learning of Sentence Embeddings [OL]. arXiv Preprint, arXiv:2104.08821.
|
[32] |
Kobayashi S. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations [OL]. arXiv Preprint, arXiv:1805.06201.
|
[33] |
Ma E. NLP Augmentation[EB/OL].[2023-03-12]. https://github.com/makcedward/nlpaug.
|
[34] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need [OL]. arXiv Preprint, arXiv:1706.03762.
|
[35] |
Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting[J]. Journal of Machine Learning Research, 2014, 15(56): 1929-1958.
|
[36] |
Kingma D P, Ba J. Adam: A Method for Stochastic Optimization [OL]. arXiv Preprint, arXiv:1412.6980.
|
[37] |
Rakib M R H, Zeh N, Jankowska M, et al. Enhancement of Short Text Clustering by Iterative Classification [OL]. arXiv Preprint, arXiv: 2001.11631.
|
[38] |
Zhang X, LeCun Y. Text Understanding from Scratch [OL]. arXiv Preprint, arXiv:1502.01710.
|
[39] |
Xu J M, Xu B, Wang P, et al. Self-Taught Convolutional Neural Networks for Short Text Clustering[J]. Neural Networks, 2017, 88: 22-31.
doi: S0893-6080(16)30197-6
pmid: 28157556
|
[40] |
Lang K. NewsWeeder: Learning to Filter Netnews[C]// Proceedings of the 12th International Conference on Machine Learning. 1995: 331-339.
|
[41] |
Cui Y M, Che W X, Liu T, et al. Pre-training with Whole Word Masking for Chinese BERT [OL]. arXiv Preprint, arXiv:1906.08101.
|
[42] |
Kuhn H W. The Hungarian Method for the Assignment Problem[J]. Naval Research Logistics Quarterly, 1955, 2(1-2): 83-97.
doi: 10.1002/nav.v2:1/2
|
[43] |
Shahnaz F, Berry M W, Pauca V P, et al. Document Clustering Using Nonnegative Matrix Factorization[J]. Information Processing & Management, 2006, 42(2): 373-386.
doi: 10.1016/j.ipm.2004.11.005
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|