Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (10): 9-19    DOI: 10.11925/infotech.2096-3467.2022.0038
Topic Clustering for Social Media Texts with Heterogeneous Graph Neural Networks
Feng Xiaodong(),Hui Kangxin
School of Public Affairs and Administration, University of Electronic Science and Technology of China, Chengdu 611731, China
[Objective] This paper develops an effective topic clustering method to address the issues of semantic sparsity and multiple interactions of social media texts. [Methods] We constructed a model for the multiple interaction relationship between social media users and online contents with the help of heterogeneous information network. First, we used word embedding method to obtain the representation of texts as the initial input features. Then, we propagated and aggregated representations of nodes with the heterogeneous graph neural network. Finally, we trained the model with representation of text nodes, and conducted an unsupervised clustering for the topics. [Results] We examined our model on the English benchmark data set, and found its NMI for original posts and comments reached 0.837 2 and 0.868 9 respectively, which were higher than those of the traditional LDA or directly clustering method with words or text embedding vectors by Word2Vec, Doc2Vec, or GolVe. [Limitations] Due to the limits of data, we did not examine the social relationship among users and multimedia contents online. [Conclusions] The proposed model can effectively improve the topic clustering for social media texts.

Key wordsSocial Media      Topic Clustering      Multiple Interactions      Heterogeneous Information Network      Graph Neural Networks     
Received: 13 January 2022      Published: 16 November 2022
ZTFLH:  TP391 G35  
Fund:Humanities and Social Sciences Foundation of the Ministry of Education, China(20YJAZH027);National Natural Science Foundation of China(72004021)
Corresponding Authors: Feng Xiaodong, ORCID:0000-0001-9975-9807     E-mail:

Feng Xiaodong, Hui Kangxin. Topic Clustering for Social Media Texts with Heterogeneous Graph Neural Networks. Data Analysis and Knowledge Discovery, 2022, 6(10): 9-19.

Research Framework of Topic Clustering of Social Media Text
An Example of the Heterogeneous Networks
Distribution of Length of Comments and Number of Comments
Subreddit(主题) link数量 用户数量 comment数量 一级comment数量 子comment数量
movies 359 7 647 10 902 6 858 4 044
news 268 9 224 14 059 6 571 7 488
NFL 219 3 395 5 029 3 136 1 893
pcmasterrace 363 4 569 6 384 4 569 1 815
relationships 281 7 186 17 234 11 422 5 812
Statistics of Reddit Dataset
Change of Clustering Index over Iterations
聚类方法 link comment
LDA 0.123 3 0.082 0 0.010 7 0.000 6
Word2Vec 0.609 8 0.589 4 0.373 7 0.371 8
GloVe 0.549 3 0.408 8 0.317 6 0.311 9
Doc2Vec 0.746 1 0.786 7 0.055 8 0.039 4
HGNN-Topic Word2Vec输入 0.659 5 0.608 9 0.866 8 0.917 5
GloVe输入 0.614 5 0.492 4 0.868 4 0.916 5
Doc2Vec输入 0.837 2 0.852 0 0.868 9 0.918 1
Results of Different Clustering Methods
Word-Cloud Figure of Different Topics on Dataset from
