SCCL Text Deep Clustering with Increased Cluster-Level Comparison

doi:10.11925/infotech.2096-3467.2023.0156

Data Analysis and Knowledge Discovery

2024, Vol. 8

Issue (3): 98-109 DOI: 10.11925/infotech.2096-3467.2023.0156

Current Issue | Archive | Adv Search

SCCL Text Deep Clustering with Increased Cluster-Level Comparison

Li Jie^1,²,Zhang Zhixiong^1,²(

),Wang Yufei^1,²

¹National Science Library, Chinese Academy of Sciences, Beijing 100190, China
²Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China

Download: PDF (1645 KB) HTML ( 3 )
Export: BibTeX | EndNote (RIS)

Abstract

[Objective] This paper proposes a new deep clustering model (ISCCL) for texts based on SCCL, aiming to improve its performance in clustering tasks. [Methods] First, the ISCCL model utilized sentence vector pre-trained models to perform data augmentation and encoding to obtain two sets of augmented representations of the input texts. Then, we added two layers of nonlinear networks to the SCCL model. It reduced the augmented representations to a cluster feature space with dimensions equal to the number of clusters. Third, we constructed positive and negative cluster pairs from the perspective of column space for contrastive learning. It guided the model to explore valuable features for clustering tasks and reduce the impact of false positive samples. [Results] In five benchmark datasets, including AgNews, Biomedical, StackOverflow, 20NewsGroups, and zh10, the clustering accuracy of the ISCCL model reached 88.89%, 48.74%, 78.17%, 56.97%, and 86.42%, respectively, which is an improvement of 0.69% to 2.67% compared to the SCCL model. [Limitations] The dimension of the cluster feature space needs to be pre-set (the same as the clustering number K value). However, it is often difficult to determine the specific cluster number of the original data, and adjustments should be made according to the dataset. [Conclusions] The ISCCL model can effectively extract cluster features and improve the deep clustering performance on texts.

Key words： Contrastive Learning Deep Clustering SCCL Cluster Feature Learning Representative Learning

Received: 03 March 2023 Published: 28 April 2023

ZTFLH:

TP391

Fund:National Science and Technology Library Special Project(2023XM42)

Corresponding Authors: Zhang Zhixiong，ORCID：0000-0003-1596-7487，E-mail： zhangzhx@mail.las.ac.cn。

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Jie Li
	Zhixiong Zhang
	Yufei Wang

Cite this article:

Li Jie, Zhang Zhixiong, Wang Yufei. SCCL Text Deep Clustering with Increased Cluster-Level Comparison. Data Analysis and Knowledge Discovery, 2024, 8(3): 98-109.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2023.0156 OR https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2024/V8/I3/98

The Framework of ISCCL Model

The Dropout in Transformer

Cluster-Wise Feature Extraction Network and Positive & Negative Cluster Pairs Construction

The Illustration of Cluster-Wise Contrastive Learning

Basic Information of Experimental Datasets

SQL Query of zh10 Dataset Construction

Data Set Statistics Information

Environment Configuration

Comparison of Experimental Results

[1]	Berkhin P. A Survey of Clustering Data Mining Techniques[M]// Grouping Multidimensional Data. Cham: Springer, 2006:25-72.
[2]	MacQueen J B. Some Methods for Classification and Analysis of Multivariate Observations[C]// Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Volume 1:Statistics. 1967: 281-297.
[3]	Lloyd S P. Least Squares Quantization in PCM[J]. IEEE Transactions on Information Theory, 1982, 28(2): 129-137. doi: 10.1109/TIT.1982.1056489
[4]	von Luxburg U V. A Tutorial on Spectral Clustering[J]. Statistics and Computing, 2007, 17(4): 395-416. doi: 10.1007/s11222-007-9033-z
[5]	Ester M, Kriegel H P, Sander J, et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise[C]// Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining(KDD-96). 1996: 226-231.
[6]	Bellman R E. Adaptive Control Processes: A Guided Tour[M]. Princeton, New Jersey: Princeton University Press, 1961.
[7]	Shlens J. A Tutorial on Principal Component Analysis [OL]. arXiv Preprint, arXiv:1404.1100.
[8]	Xie J Y, Girshick R, Farhadi A. Unsupervised Deep Embedding for Clustering Analysis[C]// Proceedings of the 33rd International Conference on Machine Learning-Volume 48. 2016: 478-487.
[9]	Yang J W, Parikh D, Batra D. Joint Unsupervised Learning of Deep Representations and Image Clusters[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016: 5147-5156.
[10]	Guo X F, Gao L, Liu X W, et al. Improved Deep Embedded Clustering with Local Structure Preservation[C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017: 1753-1759.
[11]	Wu L R, Liu Z C, Zang Z L, et al. Generalized Clustering and Multi-Manifold Learning with Geometric Structure Preservation [OL]. arXiv Preprint, arXiv:2009.09590v4.
[12]	Tian F, Gao B, Cui Q, et al. Learning Deep Representations for Graph Clustering[C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2014. DOI: 10.1609/aaai.v28i1.8916.
[13]	Jiang Z X, Zheng Y, Tan H C, et al. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering[C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017: 1965-1972.
[14]	Wu Z R, Xiong Y J, Yu S X, et al. Unsupervised Feature Learning via Non-parametric Instance Discrimination[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 3733-3742.
[15]	Chen T, Kornblith S, Norouzi M, et al. A Simple Framework for Contrastive Learning of Visual Representations [OL]. arXiv Preprint, arXiv:2002.05709.
[16]	Huang J B, Gong S G, Zhu X T. Deep Semantic Clustering by Partition Confidence Maximisation[C]// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 8846-8855.
[17]	Li J N, Zhou P, Xiong C M, et al. Prototypical Contrastive Learning of Unsupervised Representations [OL]. arXiv Preprint, arXiv: 2005.04966.
[18]	Zhang D J, Nan F, Wei X K, et al. Supporting Clustering with Contrastive Learning [OL]. arXiv Preprint, arXiv:2103.12953.
[19]	Gomes R, Krause A, Perona P. Discriminative Clustering by Regularized Information Maximization[C]// Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 1. 2010: 775-783.
[20]	Li Y F, Yang M X, Peng D Z, et al. Twin Contrastive Learning for Online Clustering[J]. International Journal of Computer Vision, 2022, 130(9): 2205-2221. doi: 10.1007/s11263-022-01639-z
[21]	Li Y F, Hu P, Liu Z T, et al. Contrastive Clustering [OL]. arXiv Preprint, arXiv:2009.09687.
[22]	Song C F, Liu F, Huang Y Z, et al. Auto-Encoder Based Data Clustering[C]//Proceedings of CIARP 2013:Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. 2013: 117-124.
[23]	Yang B, Fu X, Sidiropoulos N D, et al. Towards K-Means-Friendly Spaces: Simultaneous Deep Learning and Clustering[C]// Proceedings of the 34th International Conference on Machine Learning - Volume 70. 2017: 3861-3870.
[24]	Hadifar A, Sterckx L, Demeester T, et al. A Self-Training Approach for Short Text Clustering[C]// Proceedings of the 4th Workshop on Representation Learning for NLP. 2019: 194-199.
[25]	Bo D Y, Wang X, Shi C, et al. Structural Deep Clustering Network[C]// Proceedings of the Web Conference 2020. 2020: 1400-1410.
[26]	Hu W H, Miyato T, Tokui S, et al. Learning Discrete Representations via Information Maximizing Self-Augmented Training [OL]. arXiv Preprint, arXiv:1702.08720.
[27]	van den Oord A, Li Y Z, Vinyals O. Representation Learning with Contrastive Predictive Coding [OL]. arXiv Preprint, arXiv:1807.03748.
[28]	Sohn K. Improved Deep Metric Learning with Multi-class n-Pair Loss Objective[C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016: 1857-1865.
[29]	Yan Y M, Li R M, Wang S R, et al. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:Long Papers). 2021: 5065-5075.
[30]	张俊林. 如何利用噪音数据:对比学习在微博场景的应用[EB/OL].[2023-03-06]. https://mp.weixin.qq.com/s/9N2tk6QTCTuBkrU5Xwb2ow.
[30]	(Zhang Junlin. How to Use Noise Data: The Application of Contrastive Learning in Microblogging Scenarios[EB/OL].[2023-03-06]. https://mp.weixin.qq.com/s/9N2tk6QTCTuBkrU5Xwb2ow.)
[31]	Gao T Y, Yao X C, Chen D Q. SimCSE: Simple Contrastive Learning of Sentence Embeddings [OL]. arXiv Preprint, arXiv:2104.08821.
[32]	Kobayashi S. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations [OL]. arXiv Preprint, arXiv:1805.06201.
[33]	Ma E. NLP Augmentation[EB/OL].[2023-03-12]. https://github.com/makcedward/nlpaug.
[34]	Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need [OL]. arXiv Preprint, arXiv:1706.03762.
[35]	Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting[J]. Journal of Machine Learning Research, 2014, 15(56): 1929-1958.
[36]	Kingma D P, Ba J. Adam: A Method for Stochastic Optimization [OL]. arXiv Preprint, arXiv:1412.6980.
[37]	Rakib M R H, Zeh N, Jankowska M, et al. Enhancement of Short Text Clustering by Iterative Classification [OL]. arXiv Preprint, arXiv: 2001.11631.
[38]	Zhang X, LeCun Y. Text Understanding from Scratch [OL]. arXiv Preprint, arXiv:1502.01710.
[39]	Xu J M, Xu B, Wang P, et al. Self-Taught Convolutional Neural Networks for Short Text Clustering[J]. Neural Networks, 2017, 88: 22-31. doi: S0893-6080(16)30197-6 pmid: 28157556
[40]	Lang K. NewsWeeder: Learning to Filter Netnews[C]// Proceedings of the 12th International Conference on Machine Learning. 1995: 331-339.
[41]	Cui Y M, Che W X, Liu T, et al. Pre-training with Whole Word Masking for Chinese BERT [OL]. arXiv Preprint, arXiv:1906.08101.
[42]	Kuhn H W. The Hungarian Method for the Assignment Problem[J]. Naval Research Logistics Quarterly, 1955, 2(1-2): 83-97. doi: 10.1002/nav.v2:1/2
[43]	Shahnaz F, Berry M W, Pauca V P, et al. Document Clustering Using Nonnegative Matrix Factorization[J]. Information Processing & Management, 2006, 42(2): 373-386. doi: 10.1016/j.ipm.2004.11.005

[1]	Nie Hui, Wu Xiaoyan. Detecting Depression Factors with Gradient Boosting Tree and Explainable Machine Learning Model SHAP[J]. 数据分析与知识发现, 2024, 8(3): 41-52.
[2]	Quan Ankun, Li Honglian, Zhang Le, Lyu Xueqiang. Generating Chinese Abstracts with Content and Image Features[J]. 数据分析与知识发现, 2024, 8(3): 110-119.
[3]	Huang Taifeng, Ma Jing. Text Sentiment Classification Algorithm Based on Prompt Learning Enhancement[J]. 数据分析与知识发现, 2024, 8(3): 77-84.
[4]	Wu Yue, Sun Haichun. An Overview of Research on Knowledge Graph Completion Based on Graph Neural Network[J]. 数据分析与知识发现, 2024, 8(3): 10-28.
[5]	Zhang Zhijian, Xia Sudi, Liu Zhenghao. Seal Recognition and Application Based on Multi-feature Fusion Deep Learning[J]. 数据分析与知识发现, 2024, 8(3): 143-155.
[6]	Bai Rujiang, Chen Qiming, Zhang Yujie, Yang Chao. Research on Automatic Entities Generation of Patent Technology Function Matrix based on ChatGPT+Prompt [J]. 数据分析与知识发现, 0, (): 1-.
[7]	Liu Kan, You Meilin, Wei Lanxi. Label Distribution Learning Based on Hierarchical Tag Structure[J]. 数据分析与知识发现, 2024, 8(2): 44-55.
[8]	Liu Yi, Zhang Zhixiong, Wang Yufei, Li Xuesi. Constructing Automatic Structured Synthesis Tool for Sci-Tech Literature Based on Move Recognition[J]. 数据分析与知识发现, 2024, 8(2): 65-73.
[9]	Du Xinyu, Li Ning. Identifying Moves in Full-Text Chinese Academic Papers[J]. 数据分析与知识发现, 2024, 8(2): 74-83.
[10]	Li Xuelian, Wang Bi, Li Lixin, Han Dixuan. Sentiment Analysis with Abstract Meaning Representation and Dependency Grammar[J]. 数据分析与知识发现, 2024, 8(1): 55-68.
[11]	Hu Zhongyi, Shui Diancheng, Wu Jiang. Identifying Structural Elements of Scholarly Abstracts with ERNIE-DPCNN[J]. 数据分析与知识发现, 2024, 8(1): 125-144.
[12]	Lyu Xueqiang, Yang Yuting, Xiao Gang, Li Yuxian, You Xindong. Extracting Long Terms from Sparse Samples[J]. 数据分析与知识发现, 2024, 8(1): 135-145.
[13]	Tang Xuemei, Su Qi, Wang Jun. Classifying Ancient Chinese Text Relations with Entity Information[J]. 数据分析与知识发现, 2024, 8(1): 114-124.
[14]	Li Hui, Hu Yaohua, Xu Cunzhen. Personalized Recommendation Algorithm with Review Sentiments and Importance[J]. 数据分析与知识发现, 2024, 8(1): 69-79.
[15]	Chen Linghong, Pan Xiaohua. Recommending Books Based on Knowledge Graph and Reader Profiling[J]. 数据分析与知识发现, 2023, 7(12): 164-171.

Viewed

Full text

Abstract

Cited

Shared

Discussed