[Objective] This paper develops an effective topic clustering method to address the issues of semantic sparsity and multiple interactions of social media texts. [Methods] We constructed a model for the multiple interaction relationship between social media users and online contents with the help of heterogeneous information network. First, we used word embedding method to obtain the representation of texts as the initial input features. Then, we propagated and aggregated representations of nodes with the heterogeneous graph neural network. Finally, we trained the model with representation of text nodes, and conducted an unsupervised clustering for the topics. [Results] We examined our model on the English benchmark data set, and found its NMI for original posts and comments reached 0.837 2 and 0.868 9 respectively, which were higher than those of the traditional LDA or directly clustering method with words or text embedding vectors by Word2Vec, Doc2Vec, or GolVe. [Limitations] Due to the limits of data, we did not examine the social relationship among users and multimedia contents online. [Conclusions] The proposed model can effectively improve the topic clustering for social media texts.
冯小东, 惠康欣. 基于异构图神经网络的社交媒体文本主题聚类*[J]. 数据分析与知识发现, 2022, 6(10): 9-19.
Feng Xiaodong, Hui Kangxin. Topic Clustering for Social Media Texts with Heterogeneous Graph Neural Networks. Data Analysis and Knowledge Discovery, 2022, 6(10): 9-19.
(Yan Duanwu, Mei Xirui, Yang Xiongfei, et al. Research on Microblog Text Topic Clustering Based on the Fusion of Topic Model and Word Embedding[J]. Journal of Modern Information, 2021, 41(10): 67-74.)
doi: 10.3969/j.issn.1008-0821.2021.10.008
[2]
Li X M, Li C C, Chi J J, et al. Short Text Topic Modeling by Exploring Original Documents[J]. Knowledge and Information Systems, 2018, 56(2): 443-462.
doi: 10.1007/s10115-017-1099-0
[3]
Mehrotra R, Sanner S, Buntine W, et al. Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling[C]// Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2013: 889-892.
[4]
Vavliakis K N, Symeonidis A L, Mitkas P A. Event Identification in Web Social Media Through Named Entity Recognition and Topic Modeling[J]. Data & Knowledge Engineering, 2013, 88: 1-24.
doi: 10.1016/j.datak.2013.08.006
[5]
Curiskis S A, Drake B, Osborn T R, et al. An Evaluation of Document Clustering and Topic Modelling in Two Online Social Networks: Twitter and Reddit[J]. Information Processing & Management, 2020, 57(2): 102034.
doi: 10.1016/j.ipm.2019.04.002
[6]
Wu S Z, Zhang H P, Xu C C, et al. Text Clustering on Short Message by Using Deep Semantic Representation[C]// Proceedings of the 4th International Conference on Computer, Communication and Computational Sciences. 2019: 133-145.
[7]
Zhang C X, Song D J, Huang C, et al. Heterogeneous Graph Neural Network[C]// Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019: 793-803.
[8]
Xu S Y, Yang C, Shi C, et al. Topic-Aware Heterogeneous Graph Neural Network for Link Prediction[C]// Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2021: 2261-2270.
[9]
Allan J. Topic Detection and Tracking: Event-Based Information Organization[M]. Springer Science & Business Media, 2012.
[10]
Yang Y M, Pierce T, Carbonell J. A Study of Retrospective and On-Line Event Detection[C]// Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1998: 28-36.
[11]
Pons-Porrata A, Berlanga-Llavori R, Ruiz-Shulcloper J. Topic Discovery Based on Text Mining Techniques[J]. Information Processing & Management, 2007, 43(3): 752-768.
doi: 10.1016/j.ipm.2006.06.001
(Cai Yongming, Chang Qing. Chinese Short Text Topic Analysis by Latent Dirichlet Allocation Model with Co-Word Network Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(3): 305-317.)
(Wang Yuefen, Xu Dujuan, Yang Zhenyi, et al. Topic Detection and Subject Association Analysis on Public Opinions and News Reports[J]. Journal of Modern Information, 2018, 38(6): 3-10.)
doi: 10.3969/j.issn.1008-0821.2018.06.001
[14]
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 2013 Annual Conference on Neural Information Processing System. 2013: 3111-3119.
[15]
Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2014: 1532-1543.
[16]
Le Q, Mikolov T. Distributed Representations of Sentences and Documents[C]// Proceedings of the 31st International Conference on Machine Learning. 2014: 1188-1196.
[17]
Li C Z, Guo J Y, Lu Y, et al.LDA Meets Word2Vec: A Novel Model for Academic Abstract Clustering[C]// Proceedings of the 2018 Web Conference Companion. 2018: 1699-1706.
(Gao Yongbing, Yang Guipeng, Zhang Di, et al. Detecting Events from Official Weibo Profiles Based on Post Clustering with Burst Words[J]. Data Analysis and Knowledge Discovery, 2017, 1(9): 57-64.)
[20]
Kipf T N, Welling M. Semi-Supervised Classification with Graph Convolutional Networks[OL]. arXiv Preprint, arXiv: 1609.02907.
[21]
Hamilton W L, Ying R, Leskovec J. Inductive Representation Learning on Large Graphs[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 1025-1035.
[22]
Wang X, Ji H Y, Shi C, et al. Heterogeneous Graph Attention Network[C]// Proceedings of the 2019 World Wide Web Conference. ACM, 2019: 2022-2032.
[23]
Fu X Y, Zhang J N, Meng Z Q, et al. MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding[C]// Proceedings of the 2020 World Wide Web Conference. ACM, 2020: 2331-2341.
[24]
Hu Z N, Dong Y X, Wang K S, et al. Heterogeneous Graph Transformer[C]// Proceedings of the 2020 World Wide Web Conference. ACM, 2020: 2704-2710.
[25]
Wang X, Liu N, Han H, et al. Self-Supervised Heterogeneous Graph Neural Network with Co-Contrastive Learning[C]// Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021: 1726-1736.
[26]
Jin D, Huo C Y, Liang C D, et al. Heterogeneous Graph Neural Network via Attribute Completion[C]// Proceedings of the 2020 World Wide Web Conference. ACM, 2021: 391-400.
[27]
Bastings J, Titov I, Aziz W, et al. Graph Convolutional Encoders for Syntax-Aware Neural Machine Translation[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 1957-1967.
[28]
Yao L, Mao C S, Luo Y. Graph Convolutional Networks for Text Classification[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33: 7370-7377.
doi: 10.1609/aaai.v33i01.33017370
[29]
Yang T C, Hu L M, Shi C, et al. HGAT: Heterogeneous Graph Attention Networks for Semi-Supervised Short Text Classification[J]. ACM Transactions on Information Systems, 2021, 39(3): 1-29.
[30]
Lai Y N, Zhang L F, Han D H, et al. Fine-Grained Emotion Classification of Chinese Microblogs Based on Graph Convolution Networks[J]. World Wide Web, 2020, 23(5): 2771-2787.
doi: 10.1007/s11280-020-00803-0
(Fan Tao, Wang Hao, Wu Peng. Sentiment Analysis of Online Users’ Negative Emotions Based on Graph Convolutional Network and Dependency Parsing[J]. Data Analysis and Knowledge Discovery, 2021, 5(9): 97-106.)
(Zhou Zeyu, Wang Hao, Zhao Zibo, et al. Construction and Application of GCN Model for Text Classification with Associated Information[J]. Data Analysis and Knowledge Discovery, 2021, 5(9): 31-41.)
[33]
Zhou J, Cui G Q, Hu S D, et al. Graph Neural Networks: A Review of Methods and Applications[J]. AI Open, 2020, 1: 57-81.
doi: 10.1016/j.aiopen.2021.01.001
[34]
Dong Y X, Chawla N V, Swami A.Metapath2Vec: Scalable Representation Learning for Heterogeneous Networks[C]// Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017: 135-144.
[35]
Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online Learning of Social Representations[C]// Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 701-710.