Please wait a minute...
Advanced Search
数据分析与知识发现  2024, Vol. 8 Issue (3): 98-109     https://doi.org/10.11925/infotech.2096-3467.2023.0156
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
增加类簇级对比的SCCL文本深度聚类方法研究*
李婕1,2,张智雄1,2(),王宇飞1,2
1中国科学院文献情报中心 北京 100190
2中国科学院大学经济与管理学院信息资源管理系 北京 100190
SCCL Text Deep Clustering with Increased Cluster-Level Comparison
Li Jie1,2,Zhang Zhixiong1,2(),Wang Yufei1,2
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
全文: PDF (1645 KB)   HTML ( 3
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】改进SCCL模型在文本深度聚类任务上的效果,提出一种新的基于SCCL的文本深度聚类模型ISCCL。【方法】ISCCL模型基于句向量预训练语言模型对输入文本进行数据增强和编码获取两组增强表征,在SCCL模型的基础上增加两层非线性网络,将增强表征降维到维度与聚类数量相同的类簇特征空间。从列空间的角度构造正负簇对进行对比学习,引导模型挖掘对聚类任务有用的特征,并减少假正样本产生的影响。【结果】在AgNews、Biomedical、StackOverflow、20NewsGroups和zh10共5种基准数据集中,ISCCL模型的聚类准确率分别达到88.89%、48.74%、78.17%、56.97%和86.42%,较SCCL模型提升0.69%~2.67%。【局限】需要预先设定类簇特征空间维度(与聚类数目K值相同),然而在实际应用中往往很难明确原始数据的具体聚类数目,应当根据数据情况适当调整。【结论】ISCCL模型能够有效提取类簇特征,在SCCL模型的基础上提升了文本深度聚类效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
李婕
张智雄
王宇飞
关键词 对比学习深度聚类 SCCL类簇特征学习表示学习    
Abstract

[Objective] This paper proposes a new deep clustering model (ISCCL) for texts based on SCCL, aiming to improve its performance in clustering tasks. [Methods] First, the ISCCL model utilized sentence vector pre-trained models to perform data augmentation and encoding to obtain two sets of augmented representations of the input texts. Then, we added two layers of nonlinear networks to the SCCL model. It reduced the augmented representations to a cluster feature space with dimensions equal to the number of clusters. Third, we constructed positive and negative cluster pairs from the perspective of column space for contrastive learning. It guided the model to explore valuable features for clustering tasks and reduce the impact of false positive samples. [Results] In five benchmark datasets, including AgNews, Biomedical, StackOverflow, 20NewsGroups, and zh10, the clustering accuracy of the ISCCL model reached 88.89%, 48.74%, 78.17%, 56.97%, and 86.42%, respectively, which is an improvement of 0.69% to 2.67% compared to the SCCL model. [Limitations] The dimension of the cluster feature space needs to be pre-set (the same as the clustering number K value). However, it is often difficult to determine the specific cluster number of the original data, and adjustments should be made according to the dataset. [Conclusions] The ISCCL model can effectively extract cluster features and improve the deep clustering performance on texts.

Key wordsContrastive Learning    Deep Clustering    SCCL    Cluster Feature Learning    Representative Learning
收稿日期: 2023-03-03      出版日期: 2023-04-28
ZTFLH:  TP391  
基金资助:* 国家科技图书文献中心专项(2023XM42)
通讯作者: 张智雄,ORCID:0000-0003-1596-7487,E-mail: zhangzhx@mail.las.ac.cn。   
引用本文:   
李婕, 张智雄, 王宇飞. 增加类簇级对比的SCCL文本深度聚类方法研究*[J]. 数据分析与知识发现, 2024, 8(3): 98-109.
Li Jie, Zhang Zhixiong, Wang Yufei. SCCL Text Deep Clustering with Increased Cluster-Level Comparison. Data Analysis and Knowledge Discovery, 2024, 8(3): 98-109.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2023.0156      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2024/V8/I3/98
Fig.1  ISCCL模型框架
Fig.2  Transformer编码器中的Dropout
Fig.3  类簇特征提取网络与正负簇对构造示意图
Fig.4  类簇对比学习示意图
数据集名称 类别数量 语料数量
AgNews 4 8 000
Biomedical 20 20 000
StackOverflow 20 20 000
20NewsGroups 20 18 846
zh10 10 10 904
Table 1  实验数据集基本情况
Fig.5  zh10数据集构建的SQL查询语句
标签名称 摘要数量/篇 示例
2型糖尿病 1 546 硫化氢对型糖尿病大鼠肝脏胰岛素抵抗的影响。…可通过上调型糖尿病大鼠肝细胞和的表达水平,改善胰岛素抵抗。
肺癌 1 184 肺癌术后化疗患者的疾病不确定感与社会支持的相关性。…提高肺癌术后化疗患者的社会支持水平有利于降低患者的疾病不确定感中的复杂性。
骨质疏松 1 124 骨质疏松相关因素对腰椎退变性疾病患者内固定手术预后的影响。…体重以及骨质疏松风险因素的增高,腰椎退变性疾病患者行椎间植骨融合内固定术后的短期临床疗效越差。
冠心病 1 260 冠心病合并抑郁的相关研究进展。…情绪不稳定。易激惹导致注意力。
帕金森 1 220 调控维生素可能是中医补肾治疗帕金森病的潜在靶点。…补肾组方治疗可能通过调控维生素发挥作用,维生素可能是补肾治疗的潜在靶点。
肝炎 1 276 丙型肝炎病毒基因型及其宿主基因型的检测及临床意义。…监控药物不良反应上得到初步的应用。
抑郁症 778 抑郁症患者羟色胺与中医证型及症状关系分析。探讨抑郁症患者羟色胺水平及与其相应的中医证型…症状有一定影响。
类风湿 830 超声对类风湿关节炎抗风湿药物临床疗效的评估价值。…可用于评估临床疗效。
哮喘 834 应高度重视支气管哮喘的疾病负担研究。…有利于全国制定相关的卫生政策,为合理利用卫生资源提供参考依据。
癫痫 852 老年人新发癫痫临床分析。…部分性发作最常见,尤以复杂部分性发作最多。多数患者单药治疗有效。
总计 10 904
Table 2  zh10数据集示例
开发环境 配置
软件环境 操作系统 Ubuntu 18.04.5 LTS
开发工具 PyCharm 2021.3.3
(Professional Edition)
编译环境 Python 3.8
GPU加速环境 CUDA 11.0
数据库 Microsoft SQL Server 2012
硬件环境 CPU 40 Intel(R)Xeon(R)Silver 4210 CPU @2.20GHz
内存、磁盘 188GB、4TB
GPU 3 NVIDIA GeForce RTX TITAN
Table 3  软硬件开发环境配置
模型 AgNews Biomedical StackOverflow 20NewsGroups zh10
ACC NMI ACC NMI ACC NMI ACC NMI ACC NMI
Bow*[18] 27.60 2.60 14.30 9.20 18.50 14.00 - - - -
STCC*[18] - - 43.60 38.10 51.10 49.00 - - - -
Self-Train*[18] - - 54.80 47.10 59.80 54.80 - - - -
HAC-SD*[18] 81.80 54.60 40.10 33.50 64.80 59.50 - - - -
K-Means[2] 31.11 4.920 19.53 16.40 22.55 24.00 19.49 20.33 65.22 61.52
HC[4] 36.45 10.12 15.49 13.56 10.89 7.84 24.50 25.01 46.08 37.92
DEC[8] 62.19 36.71 10.29 3.95 9.06 2.12 41.20 49.79 41.48 34.26
IDEC[10] 68.36 35.48 13.66 10.40 9.92 3.77 33.91 30.38 14.20 0.06
AE+K-Means[22] 58.14 28.02 11.44 9.56 9.05 4.06 33.45 30.03 14.61 1.21
IMSAT[26] 54.57 36.40 15.66 8.28 8.54 1.99 42.47 43.29 28.05 17.08
SCCL*[18] 88.20 68.20 46.20 41.50 75.50 74.50 - - - -
ISCCL 88.89 69.33 48.74 41.25 78.17 75.71 56.97 58.95 86.42 84.52
Table 4  ISCCL与其他模型的对比实验结果(%)
[1] Berkhin P. A Survey of Clustering Data Mining Techniques[M]// Grouping Multidimensional Data. Cham: Springer, 2006:25-72.
[2] MacQueen J B. Some Methods for Classification and Analysis of Multivariate Observations[C]// Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Volume 1:Statistics. 1967: 281-297.
[3] Lloyd S P. Least Squares Quantization in PCM[J]. IEEE Transactions on Information Theory, 1982, 28(2): 129-137.
doi: 10.1109/TIT.1982.1056489
[4] von Luxburg U V. A Tutorial on Spectral Clustering[J]. Statistics and Computing, 2007, 17(4): 395-416.
doi: 10.1007/s11222-007-9033-z
[5] Ester M, Kriegel H P, Sander J, et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise[C]// Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining(KDD-96). 1996: 226-231.
[6] Bellman R E. Adaptive Control Processes: A Guided Tour[M]. Princeton, New Jersey: Princeton University Press, 1961.
[7] Shlens J. A Tutorial on Principal Component Analysis [OL]. arXiv Preprint, arXiv:1404.1100.
[8] Xie J Y, Girshick R, Farhadi A. Unsupervised Deep Embedding for Clustering Analysis[C]// Proceedings of the 33rd International Conference on Machine Learning-Volume 48. 2016: 478-487.
[9] Yang J W, Parikh D, Batra D. Joint Unsupervised Learning of Deep Representations and Image Clusters[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016: 5147-5156.
[10] Guo X F, Gao L, Liu X W, et al. Improved Deep Embedded Clustering with Local Structure Preservation[C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017: 1753-1759.
[11] Wu L R, Liu Z C, Zang Z L, et al. Generalized Clustering and Multi-Manifold Learning with Geometric Structure Preservation [OL]. arXiv Preprint, arXiv:2009.09590v4.
[12] Tian F, Gao B, Cui Q, et al. Learning Deep Representations for Graph Clustering[C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2014. DOI: 10.1609/aaai.v28i1.8916.
[13] Jiang Z X, Zheng Y, Tan H C, et al. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering[C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017: 1965-1972.
[14] Wu Z R, Xiong Y J, Yu S X, et al. Unsupervised Feature Learning via Non-parametric Instance Discrimination[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 3733-3742.
[15] Chen T, Kornblith S, Norouzi M, et al. A Simple Framework for Contrastive Learning of Visual Representations [OL]. arXiv Preprint, arXiv:2002.05709.
[16] Huang J B, Gong S G, Zhu X T. Deep Semantic Clustering by Partition Confidence Maximisation[C]// Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 8846-8855.
[17] Li J N, Zhou P, Xiong C M, et al. Prototypical Contrastive Learning of Unsupervised Representations [OL]. arXiv Preprint, arXiv: 2005.04966.
[18] Zhang D J, Nan F, Wei X K, et al. Supporting Clustering with Contrastive Learning [OL]. arXiv Preprint, arXiv:2103.12953.
[19] Gomes R, Krause A, Perona P. Discriminative Clustering by Regularized Information Maximization[C]// Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 1. 2010: 775-783.
[20] Li Y F, Yang M X, Peng D Z, et al. Twin Contrastive Learning for Online Clustering[J]. International Journal of Computer Vision, 2022, 130(9): 2205-2221.
doi: 10.1007/s11263-022-01639-z
[21] Li Y F, Hu P, Liu Z T, et al. Contrastive Clustering [OL]. arXiv Preprint, arXiv:2009.09687.
[22] Song C F, Liu F, Huang Y Z, et al. Auto-Encoder Based Data Clustering[C]//Proceedings of CIARP 2013:Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. 2013: 117-124.
[23] Yang B, Fu X, Sidiropoulos N D, et al. Towards K-Means-Friendly Spaces: Simultaneous Deep Learning and Clustering[C]// Proceedings of the 34th International Conference on Machine Learning - Volume 70. 2017: 3861-3870.
[24] Hadifar A, Sterckx L, Demeester T, et al. A Self-Training Approach for Short Text Clustering[C]// Proceedings of the 4th Workshop on Representation Learning for NLP. 2019: 194-199.
[25] Bo D Y, Wang X, Shi C, et al. Structural Deep Clustering Network[C]// Proceedings of the Web Conference 2020. 2020: 1400-1410.
[26] Hu W H, Miyato T, Tokui S, et al. Learning Discrete Representations via Information Maximizing Self-Augmented Training [OL]. arXiv Preprint, arXiv:1702.08720.
[27] van den Oord A, Li Y Z, Vinyals O. Representation Learning with Contrastive Predictive Coding [OL]. arXiv Preprint, arXiv:1807.03748.
[28] Sohn K. Improved Deep Metric Learning with Multi-class n-Pair Loss Objective[C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016: 1857-1865.
[29] Yan Y M, Li R M, Wang S R, et al. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:Long Papers). 2021: 5065-5075.
[30] 张俊林. 如何利用噪音数据:对比学习在微博场景的应用[EB/OL].[2023-03-06]. https://mp.weixin.qq.com/s/9N2tk6QTCTuBkrU5Xwb2ow.
[30] (Zhang Junlin. How to Use Noise Data: The Application of Contrastive Learning in Microblogging Scenarios[EB/OL].[2023-03-06]. https://mp.weixin.qq.com/s/9N2tk6QTCTuBkrU5Xwb2ow.)
[31] Gao T Y, Yao X C, Chen D Q. SimCSE: Simple Contrastive Learning of Sentence Embeddings [OL]. arXiv Preprint, arXiv:2104.08821.
[32] Kobayashi S. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations [OL]. arXiv Preprint, arXiv:1805.06201.
[33] Ma E. NLP Augmentation[EB/OL].[2023-03-12]. https://github.com/makcedward/nlpaug.
[34] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need [OL]. arXiv Preprint, arXiv:1706.03762.
[35] Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting[J]. Journal of Machine Learning Research, 2014, 15(56): 1929-1958.
[36] Kingma D P, Ba J. Adam: A Method for Stochastic Optimization [OL]. arXiv Preprint, arXiv:1412.6980.
[37] Rakib M R H, Zeh N, Jankowska M, et al. Enhancement of Short Text Clustering by Iterative Classification [OL]. arXiv Preprint, arXiv: 2001.11631.
[38] Zhang X, LeCun Y. Text Understanding from Scratch [OL]. arXiv Preprint, arXiv:1502.01710.
[39] Xu J M, Xu B, Wang P, et al. Self-Taught Convolutional Neural Networks for Short Text Clustering[J]. Neural Networks, 2017, 88: 22-31.
doi: S0893-6080(16)30197-6 pmid: 28157556
[40] Lang K. NewsWeeder: Learning to Filter Netnews[C]// Proceedings of the 12th International Conference on Machine Learning. 1995: 331-339.
[41] Cui Y M, Che W X, Liu T, et al. Pre-training with Whole Word Masking for Chinese BERT [OL]. arXiv Preprint, arXiv:1906.08101.
[42] Kuhn H W. The Hungarian Method for the Assignment Problem[J]. Naval Research Logistics Quarterly, 1955, 2(1-2): 83-97.
doi: 10.1002/nav.v2:1/2
[43] Shahnaz F, Berry M W, Pauca V P, et al. Document Clustering Using Nonnegative Matrix Factorization[J]. Information Processing & Management, 2006, 42(2): 373-386.
doi: 10.1016/j.ipm.2004.11.005
[1] 李慧, 刘莎, 胡耀华, 孟玮. 融合异质网络与表示学习的科研合作预测方法研究*[J]. 数据分析与知识发现, 2023, 7(9): 78-88.
[2] 曹琨, 吴新年, 靳军宝, 郑玉荣, 付爽. 基于共词和Node2Vec表示学习的新兴技术识别方法*[J]. 数据分析与知识发现, 2023, 7(9): 89-99.
[3] 吴佳伦, 张若楠, 康武林, 袁普卫. 基于患者相似性分析的药物推荐深度学习模型研究*[J]. 数据分析与知识发现, 2023, 7(6): 148-160.
[4] 崔焕庆, 杨峻铸, 宋玮情. 基于相似特征和关系图优化的姓名消歧*[J]. 数据分析与知识发现, 2023, 7(5): 71-80.
[5] 邓启平, 陈卫静, 嵇灵, 张宇娥. 一种基于异质信息网络的学术文献作者重名消歧方法*[J]. 数据分析与知识发现, 2022, 6(4): 60-68.
[6] 陈文杰,文奕,杨宁. 基于节点向量表示的模糊重叠社区划分算法*[J]. 数据分析与知识发现, 2021, 5(5): 41-50.
[7] 张鑫,文奕,许海云. 一种融合表示学习与主题表征的作者合作预测模型*[J]. 数据分析与知识发现, 2021, 5(3): 88-100.
[8] 张金柱, 于文倩. 基于短语表示学习的主题识别及其表征词抽取方法研究[J]. 数据分析与知识发现, 2021, 5(2): 50-60.
[9] 余传明, 张贞港, 孔令格. 面向链接预测的知识图谱表示模型对比研究*[J]. 数据分析与知识发现, 2021, 5(11): 29-44.
[10] 余传明, 王曼怡, 林虹君, 朱星宇, 黄婷婷, 安璐. 基于深度学习的词汇表示模型对比研究*[J]. 数据分析与知识发现, 2020, 4(8): 28-40.
[11] 余传明,钟韵辞,林奥琛,安璐. 基于网络表示学习的作者重名消歧研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 48-59.
[12] 丁勇,陈夕,蒋翠清,王钊. 一种融合网络表示学习与XGBoost的评分预测模型*[J]. 数据分析与知识发现, 2020, 4(11): 52-62.
[13] 张金柱,主立鹏,刘菁婕. 基于表示学习的无监督跨语言专利推荐研究*[J]. 数据分析与知识发现, 2020, 4(10): 93-103.
[14] 余传明,李浩男,王曼怡,黄婷婷,安璐. 基于深度学习的知识表示研究:网络视角*[J]. 数据分析与知识发现, 2020, 4(1): 63-75.
[15] 曾庆田,胡晓慧,李超. 融合主题词嵌入和网络结构分析的主题关键词提取方法 *[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn