Please wait a minute...
Data Analysis and Knowledge Discovery  2024, Vol. 8 Issue (3): 98-109    DOI: 10.11925/infotech.2096-3467.2023.0156
Current Issue | Archive | Adv Search |
SCCL Text Deep Clustering with Increased Cluster-Level Comparison
Li Jie1,2,Zhang Zhixiong1,2(),Wang Yufei1,2
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
Download: PDF (1645 KB)   HTML ( 3
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new deep clustering model (ISCCL) for texts based on SCCL, aiming to improve its performance in clustering tasks. [Methods] First, the ISCCL model utilized sentence vector pre-trained models to perform data augmentation and encoding to obtain two sets of augmented representations of the input texts. Then, we added two layers of nonlinear networks to the SCCL model. It reduced the augmented representations to a cluster feature space with dimensions equal to the number of clusters. Third, we constructed positive and negative cluster pairs from the perspective of column space for contrastive learning. It guided the model to explore valuable features for clustering tasks and reduce the impact of false positive samples. [Results] In five benchmark datasets, including AgNews, Biomedical, StackOverflow, 20NewsGroups, and zh10, the clustering accuracy of the ISCCL model reached 88.89%, 48.74%, 78.17%, 56.97%, and 86.42%, respectively, which is an improvement of 0.69% to 2.67% compared to the SCCL model. [Limitations] The dimension of the cluster feature space needs to be pre-set (the same as the clustering number K value). However, it is often difficult to determine the specific cluster number of the original data, and adjustments should be made according to the dataset. [Conclusions] The ISCCL model can effectively extract cluster features and improve the deep clustering performance on texts.

Key wordsContrastive Learning      Deep Clustering      SCCL      Cluster Feature Learning      Representative Learning     
Received: 03 March 2023      Published: 28 April 2023
ZTFLH:  TP391  
Fund:National Science and Technology Library Special Project(2023XM42)
Corresponding Authors: Zhang Zhixiong,ORCID:0000-0003-1596-7487,E-mail: zhangzhx@mail.las.ac.cn。   

Cite this article:

Li Jie, Zhang Zhixiong, Wang Yufei. SCCL Text Deep Clustering with Increased Cluster-Level Comparison. Data Analysis and Knowledge Discovery, 2024, 8(3): 98-109.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2023.0156     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2024/V8/I3/98

The Framework of ISCCL Model
The Dropout in Transformer
Cluster-Wise Feature Extraction Network and Positive & Negative Cluster Pairs Construction
The Illustration of Cluster-Wise Contrastive Learning
数据集名称 类别数量 语料数量
AgNews 4 8 000
Biomedical 20 20 000
StackOverflow 20 20 000
20NewsGroups 20 18 846
zh10 10 10 904
Basic Information of Experimental Datasets
SQL Query of zh10 Dataset Construction
标签名称 摘要数量/篇 示例
2型糖尿病 1 546 硫化氢对型糖尿病大鼠肝脏胰岛素抵抗的影响。…可通过上调型糖尿病大鼠肝细胞和的表达水平,改善胰岛素抵抗。
肺癌 1 184 肺癌术后化疗患者的疾病不确定感与社会支持的相关性。…提高肺癌术后化疗患者的社会支持水平有利于降低患者的疾病不确定感中的复杂性。
骨质疏松 1 124 骨质疏松相关因素对腰椎退变性疾病患者内固定手术预后的影响。…体重以及骨质疏松风险因素的增高,腰椎退变性疾病患者行椎间植骨融合内固定术后的短期临床疗效越差。
冠心病 1 260 冠心病合并抑郁的相关研究进展。…情绪不稳定。易激惹导致注意力。
帕金森 1 220 调控维生素可能是中医补肾治疗帕金森病的潜在靶点。…补肾组方治疗可能通过调控维生素发挥作用,维生素可能是补肾治疗的潜在靶点。
肝炎 1 276 丙型肝炎病毒基因型及其宿主基因型的检测及临床意义。…监控药物不良反应上得到初步的应用。
抑郁症 778 抑郁症患者羟色胺与中医证型及症状关系分析。探讨抑郁症患者羟色胺水平及与其相应的中医证型…症状有一定影响。
类风湿 830 超声对类风湿关节炎抗风湿药物临床疗效的评估价值。…可用于评估临床疗效。
哮喘 834 应高度重视支气管哮喘的疾病负担研究。…有利于全国制定相关的卫生政策,为合理利用卫生资源提供参考依据。
癫痫 852 老年人新发癫痫临床分析。…部分性发作最常见,尤以复杂部分性发作最多。多数患者单药治疗有效。
总计 10 904
Data Set Statistics Information
开发环境 配置
软件环境 操作系统 Ubuntu 18.04.5 LTS
开发工具 PyCharm 2021.3.3
(Professional Edition)
编译环境 Python 3.8
GPU加速环境 CUDA 11.0
数据库 Microsoft SQL Server 2012
硬件环境 CPU 40 Intel(R)Xeon(R)Silver 4210 CPU @2.20GHz
内存、磁盘 188GB、4TB
GPU 3 NVIDIA GeForce RTX TITAN
Environment Configuration
模型 AgNews Biomedical StackOverflow 20NewsGroups zh10
ACC NMI ACC NMI ACC NMI ACC NMI ACC NMI
Bow*[18] 27.60 2.60 14.30 9.20 18.50 14.00 - - - -
STCC*[18] - - 43.60 38.10 51.10 49.00 - - - -
Self-Train*[18] - - 54.80 47.10 59.80 54.80 - - - -
HAC-SD*[18] 81.80 54.60 40.10 33.50 64.80 59.50 - - - -
K-Means[2] 31.11 4.920 19.53 16.40 22.55 24.00 19.49 20.33 65.22 61.52
HC[4] 36.45 10.12 15.49 13.56 10.89 7.84 24.50 25.01 46.08 37.92
DEC[8] 62.19 36.71 10.29 3.95 9.06 2.12 41.20 49.79 41.48 34.26
IDEC[10] 68.36 35.48 13.66 10.40 9.92 3.77 33.91 30.38 14.20 0.06
AE+K-Means[22] 58.14 28.02 11.44 9.56 9.05 4.06 33.45 30.03 14.61 1.21
IMSAT[26] 54.57 36.40 15.66 8.28 8.54 1.99 42.47 43.29 28.05 17.08
SCCL*[18] 88.20 68.20 46.20 41.50 75.50 74.50 - - - -
ISCCL 88.89 69.33 48.74 41.25 78.17 75.71 56.97 58.95 86.42 84.52
Comparison of Experimental Results
[1] Berkhin P. A Survey of Clustering Data Mining Techniques[M]// Grouping Multidimensional Data. Cham: Springer, 2006:25-72.
[2] MacQueen J B. Some Methods for Classification and Analysis of Multivariate Observations[C]// Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Volume 1:Statistics. 1967: 281-297.
[3] Lloyd S P. Least Squares Quantization in PCM[J]. IEEE Transactions on Information Theory, 1982, 28(2): 129-137.
doi: 10.1109/TIT.1982.1056489
[4] von Luxburg U V. A Tutorial on Spectral Clustering[J]. Statistics and Computing, 2007, 17(4): 395-416.
doi: 10.1007/s11222-007-9033-z
[5] Ester M, Kriegel H P, Sander J, et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise[C]// Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining(KDD-96). 1996: 226-231.
[6] Bellman R E. Adaptive Control Processes: A Guided Tour[M]. Princeton, New Jersey: Princeton University Press, 1961.
[7] Shlens J. A Tutorial on Principal Component Analysis [OL]. arXiv Preprint, arXiv:1404.1100.
[8] Xie J Y, Girshick R, Farhadi A. Unsupervised Deep Embedding for Clustering Analysis[C]// Proceedings of the 33rd International Conference on Machine Learning-Volume 48. 2016: 478-487.
[9] Yang J W, Parikh D, Batra D. Joint Unsupervised Learning of Deep Representations and Image Clusters[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016: 5147-5156.
[10] Guo X F, Gao L, Liu X W, et al. Improved Deep Embedded Clustering with Local Structure Preservation[C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017: 1753-1759.
[11] Wu L R, Liu Z C, Zang Z L, et al. Generalized Clustering and Multi-Manifold Learning with Geometric Structure Preservation [OL]. arXiv Preprint, arXiv:2009.09590v4.
[12] Tian F, Gao B, Cui Q, et al. Learning Deep Representations for Graph Clustering[C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2014. DOI: 10.1609/aaai.v28i1.8916.
[13] Jiang Z X, Zheng Y, Tan H C, et al. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering[C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017: 1965-1972.
[14] Wu Z R, Xiong Y J, Yu S X, et al. Unsupervised Feature Learning via Non-parametric Instance Discrimination[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 3733-3742.
[15] Chen T, Kornblith S, Norouzi M, et al. A Simple Framework for Contrastive Learning of Visual Representations [OL]. arXiv Preprint, arXiv:2002.05709.
[16] Huang J B, Gong S G, Zhu X T. Deep Semantic Clustering by Partition Confidence Maximisation[C]// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 8846-8855.
[17] Li J N, Zhou P, Xiong C M, et al. Prototypical Contrastive Learning of Unsupervised Representations [OL]. arXiv Preprint, arXiv: 2005.04966.
[18] Zhang D J, Nan F, Wei X K, et al. Supporting Clustering with Contrastive Learning [OL]. arXiv Preprint, arXiv:2103.12953.
[19] Gomes R, Krause A, Perona P. Discriminative Clustering by Regularized Information Maximization[C]// Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 1. 2010: 775-783.
[20] Li Y F, Yang M X, Peng D Z, et al. Twin Contrastive Learning for Online Clustering[J]. International Journal of Computer Vision, 2022, 130(9): 2205-2221.
doi: 10.1007/s11263-022-01639-z
[21] Li Y F, Hu P, Liu Z T, et al. Contrastive Clustering [OL]. arXiv Preprint, arXiv:2009.09687.
[22] Song C F, Liu F, Huang Y Z, et al. Auto-Encoder Based Data Clustering[C]//Proceedings of CIARP 2013:Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. 2013: 117-124.
[23] Yang B, Fu X, Sidiropoulos N D, et al. Towards K-Means-Friendly Spaces: Simultaneous Deep Learning and Clustering[C]// Proceedings of the 34th International Conference on Machine Learning - Volume 70. 2017: 3861-3870.
[24] Hadifar A, Sterckx L, Demeester T, et al. A Self-Training Approach for Short Text Clustering[C]// Proceedings of the 4th Workshop on Representation Learning for NLP. 2019: 194-199.
[25] Bo D Y, Wang X, Shi C, et al. Structural Deep Clustering Network[C]// Proceedings of the Web Conference 2020. 2020: 1400-1410.
[26] Hu W H, Miyato T, Tokui S, et al. Learning Discrete Representations via Information Maximizing Self-Augmented Training [OL]. arXiv Preprint, arXiv:1702.08720.
[27] van den Oord A, Li Y Z, Vinyals O. Representation Learning with Contrastive Predictive Coding [OL]. arXiv Preprint, arXiv:1807.03748.
[28] Sohn K. Improved Deep Metric Learning with Multi-class n-Pair Loss Objective[C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016: 1857-1865.
[29] Yan Y M, Li R M, Wang S R, et al. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:Long Papers). 2021: 5065-5075.
[30] 张俊林. 如何利用噪音数据:对比学习在微博场景的应用[EB/OL].[2023-03-06]. https://mp.weixin.qq.com/s/9N2tk6QTCTuBkrU5Xwb2ow.
[30] (Zhang Junlin. How to Use Noise Data: The Application of Contrastive Learning in Microblogging Scenarios[EB/OL].[2023-03-06]. https://mp.weixin.qq.com/s/9N2tk6QTCTuBkrU5Xwb2ow.)
[31] Gao T Y, Yao X C, Chen D Q. SimCSE: Simple Contrastive Learning of Sentence Embeddings [OL]. arXiv Preprint, arXiv:2104.08821.
[32] Kobayashi S. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations [OL]. arXiv Preprint, arXiv:1805.06201.
[33] Ma E. NLP Augmentation[EB/OL].[2023-03-12]. https://github.com/makcedward/nlpaug.
[34] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need [OL]. arXiv Preprint, arXiv:1706.03762.
[35] Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting[J]. Journal of Machine Learning Research, 2014, 15(56): 1929-1958.
[36] Kingma D P, Ba J. Adam: A Method for Stochastic Optimization [OL]. arXiv Preprint, arXiv:1412.6980.
[37] Rakib M R H, Zeh N, Jankowska M, et al. Enhancement of Short Text Clustering by Iterative Classification [OL]. arXiv Preprint, arXiv: 2001.11631.
[38] Zhang X, LeCun Y. Text Understanding from Scratch [OL]. arXiv Preprint, arXiv:1502.01710.
[39] Xu J M, Xu B, Wang P, et al. Self-Taught Convolutional Neural Networks for Short Text Clustering[J]. Neural Networks, 2017, 88: 22-31.
doi: S0893-6080(16)30197-6 pmid: 28157556
[40] Lang K. NewsWeeder: Learning to Filter Netnews[C]// Proceedings of the 12th International Conference on Machine Learning. 1995: 331-339.
[41] Cui Y M, Che W X, Liu T, et al. Pre-training with Whole Word Masking for Chinese BERT [OL]. arXiv Preprint, arXiv:1906.08101.
[42] Kuhn H W. The Hungarian Method for the Assignment Problem[J]. Naval Research Logistics Quarterly, 1955, 2(1-2): 83-97.
doi: 10.1002/nav.v2:1/2
[43] Shahnaz F, Berry M W, Pauca V P, et al. Document Clustering Using Nonnegative Matrix Factorization[J]. Information Processing & Management, 2006, 42(2): 373-386.
doi: 10.1016/j.ipm.2004.11.005
[1] Nie Hui, Wu Xiaoyan. Detecting Depression Factors with Gradient Boosting Tree and Explainable Machine Learning Model SHAP[J]. 数据分析与知识发现, 2024, 8(3): 41-52.
[2] Quan Ankun, Li Honglian, Zhang Le, Lyu Xueqiang. Generating Chinese Abstracts with Content and Image Features[J]. 数据分析与知识发现, 2024, 8(3): 110-119.
[3] Huang Taifeng, Ma Jing. Text Sentiment Classification Algorithm Based on Prompt Learning Enhancement[J]. 数据分析与知识发现, 2024, 8(3): 77-84.
[4] Wu Yue, Sun Haichun. An Overview of Research on Knowledge Graph Completion Based on Graph Neural Network[J]. 数据分析与知识发现, 2024, 8(3): 10-28.
[5] Zhang Zhijian, Xia Sudi, Liu Zhenghao. Seal Recognition and Application Based on Multi-feature Fusion Deep Learning[J]. 数据分析与知识发现, 2024, 8(3): 143-155.
[6] Bai Rujiang, Chen Qiming, Zhang Yujie, Yang Chao. Research on Automatic Entities Generation of Patent Technology Function Matrix based on ChatGPT+Prompt [J]. 数据分析与知识发现, 0, (): 1-.
[7] Liu Kan, You Meilin, Wei Lanxi. Label Distribution Learning Based on Hierarchical Tag Structure[J]. 数据分析与知识发现, 2024, 8(2): 44-55.
[8] Liu Yi, Zhang Zhixiong, Wang Yufei, Li Xuesi. Constructing Automatic Structured Synthesis Tool for Sci-Tech Literature Based on Move Recognition[J]. 数据分析与知识发现, 2024, 8(2): 65-73.
[9] Du Xinyu, Li Ning. Identifying Moves in Full-Text Chinese Academic Papers[J]. 数据分析与知识发现, 2024, 8(2): 74-83.
[10] Li Xuelian, Wang Bi, Li Lixin, Han Dixuan. Sentiment Analysis with Abstract Meaning Representation and Dependency Grammar[J]. 数据分析与知识发现, 2024, 8(1): 55-68.
[11] Hu Zhongyi, Shui Diancheng, Wu Jiang. Identifying Structural Elements of Scholarly Abstracts with ERNIE-DPCNN[J]. 数据分析与知识发现, 2024, 8(1): 125-144.
[12] Lyu Xueqiang, Yang Yuting, Xiao Gang, Li Yuxian, You Xindong. Extracting Long Terms from Sparse Samples[J]. 数据分析与知识发现, 2024, 8(1): 135-145.
[13] Tang Xuemei, Su Qi, Wang Jun. Classifying Ancient Chinese Text Relations with Entity Information[J]. 数据分析与知识发现, 2024, 8(1): 114-124.
[14] Li Hui, Hu Yaohua, Xu Cunzhen. Personalized Recommendation Algorithm with Review Sentiments and Importance[J]. 数据分析与知识发现, 2024, 8(1): 69-79.
[15] Chen Linghong, Pan Xiaohua. Recommending Books Based on Knowledge Graph and Reader Profiling[J]. 数据分析与知识发现, 2023, 7(12): 164-171.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn