Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (10): 9-19    DOI: 10.11925/infotech.2096-3467.2022.0038
Current Issue | Archive | Adv Search |
Topic Clustering for Social Media Texts with Heterogeneous Graph Neural Networks
Feng Xiaodong(),Hui Kangxin
School of Public Affairs and Administration, University of Electronic Science and Technology of China, Chengdu 611731, China
Download: PDF (2101 KB)   HTML ( 40
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper develops an effective topic clustering method to address the issues of semantic sparsity and multiple interactions of social media texts. [Methods] We constructed a model for the multiple interaction relationship between social media users and online contents with the help of heterogeneous information network. First, we used word embedding method to obtain the representation of texts as the initial input features. Then, we propagated and aggregated representations of nodes with the heterogeneous graph neural network. Finally, we trained the model with representation of text nodes, and conducted an unsupervised clustering for the topics. [Results] We examined our model on the English benchmark data set, and found its NMI for original posts and comments reached 0.837 2 and 0.868 9 respectively, which were higher than those of the traditional LDA or directly clustering method with words or text embedding vectors by Word2Vec, Doc2Vec, or GolVe. [Limitations] Due to the limits of data, we did not examine the social relationship among users and multimedia contents online. [Conclusions] The proposed model can effectively improve the topic clustering for social media texts.

Key wordsSocial Media      Topic Clustering      Multiple Interactions      Heterogeneous Information Network      Graph Neural Networks     
Received: 13 January 2022      Published: 16 November 2022
ZTFLH:  TP391 G35  
Fund:Humanities and Social Sciences Foundation of the Ministry of Education, China(20YJAZH027);National Natural Science Foundation of China(72004021)
Corresponding Authors: Feng Xiaodong, ORCID:0000-0001-9975-9807     E-mail: fengxd1988@hotmail.com

Cite this article:

Feng Xiaodong, Hui Kangxin. Topic Clustering for Social Media Texts with Heterogeneous Graph Neural Networks. Data Analysis and Knowledge Discovery, 2022, 6(10): 9-19.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0038     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I10/9

Research Framework of Topic Clustering of Social Media Text
An Example of the Heterogeneous Networks
Distribution of Length of Comments and Number of Comments
Subreddit(主题) link数量 用户数量 comment数量 一级comment数量 子comment数量
movies 359 7 647 10 902 6 858 4 044
news 268 9 224 14 059 6 571 7 488
NFL 219 3 395 5 029 3 136 1 893
pcmasterrace 363 4 569 6 384 4 569 1 815
relationships 281 7 186 17 234 11 422 5 812
Statistics of Reddit Dataset
Change of Clustering Index over Iterations
聚类方法 link comment
NMI ARI NMI ARI
LDA 0.123 3 0.082 0 0.010 7 0.000 6
Word2Vec 0.609 8 0.589 4 0.373 7 0.371 8
GloVe 0.549 3 0.408 8 0.317 6 0.311 9
Doc2Vec 0.746 1 0.786 7 0.055 8 0.039 4
HGNN-Topic Word2Vec输入 0.659 5 0.608 9 0.866 8 0.917 5
GloVe输入 0.614 5 0.492 4 0.868 4 0.916 5
Doc2Vec输入 0.837 2 0.852 0 0.868 9 0.918 1
Results of Different Clustering Methods
Word-Cloud Figure of Different Topics on Dataset from ScienceNet.cn
[1] 颜端武, 梅喜瑞, 杨雄飞, 等. 基于主题模型和词向量融合的微博文本主题聚类研究[J]. 现代情报, 2021, 41(10): 67-74.
doi: 10.3969/j.issn.1008-0821.2021.10.008
[1] (Yan Duanwu, Mei Xirui, Yang Xiongfei, et al. Research on Microblog Text Topic Clustering Based on the Fusion of Topic Model and Word Embedding[J]. Journal of Modern Information, 2021, 41(10): 67-74.)
doi: 10.3969/j.issn.1008-0821.2021.10.008
[2] Li X M, Li C C, Chi J J, et al. Short Text Topic Modeling by Exploring Original Documents[J]. Knowledge and Information Systems, 2018, 56(2): 443-462.
doi: 10.1007/s10115-017-1099-0
[3] Mehrotra R, Sanner S, Buntine W, et al. Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling[C]// Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2013: 889-892.
[4] Vavliakis K N, Symeonidis A L, Mitkas P A. Event Identification in Web Social Media Through Named Entity Recognition and Topic Modeling[J]. Data & Knowledge Engineering, 2013, 88: 1-24.
doi: 10.1016/j.datak.2013.08.006
[5] Curiskis S A, Drake B, Osborn T R, et al. An Evaluation of Document Clustering and Topic Modelling in Two Online Social Networks: Twitter and Reddit[J]. Information Processing & Management, 2020, 57(2): 102034.
doi: 10.1016/j.ipm.2019.04.002
[6] Wu S Z, Zhang H P, Xu C C, et al. Text Clustering on Short Message by Using Deep Semantic Representation[C]// Proceedings of the 4th International Conference on Computer, Communication and Computational Sciences. 2019: 133-145.
[7] Zhang C X, Song D J, Huang C, et al. Heterogeneous Graph Neural Network[C]// Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019: 793-803.
[8] Xu S Y, Yang C, Shi C, et al. Topic-Aware Heterogeneous Graph Neural Network for Link Prediction[C]// Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2021: 2261-2270.
[9] Allan J. Topic Detection and Tracking: Event-Based Information Organization[M]. Springer Science & Business Media, 2012.
[10] Yang Y M, Pierce T, Carbonell J. A Study of Retrospective and On-Line Event Detection[C]// Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1998: 28-36.
[11] Pons-Porrata A, Berlanga-Llavori R, Ruiz-Shulcloper J. Topic Discovery Based on Text Mining Techniques[J]. Information Processing & Management, 2007, 43(3): 752-768.
doi: 10.1016/j.ipm.2006.06.001
[12] 蔡永明, 长青. 共词网络LDA模型的中文短文本主题分析[J]. 情报学报, 2018, 37(3): 305-317.
[12] (Cai Yongming, Chang Qing. Chinese Short Text Topic Analysis by Latent Dirichlet Allocation Model with Co-Word Network Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(3): 305-317.)
[13] 王曰芬, 许杜娟, 杨振怡, 等. 舆情评论与新闻报道的话题识别及其主题关联分析[J]. 现代情报, 2018, 38(6): 3-10.
doi: 10.3969/j.issn.1008-0821.2018.06.001
[13] (Wang Yuefen, Xu Dujuan, Yang Zhenyi, et al. Topic Detection and Subject Association Analysis on Public Opinions and News Reports[J]. Journal of Modern Information, 2018, 38(6): 3-10.)
doi: 10.3969/j.issn.1008-0821.2018.06.001
[14] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 2013 Annual Conference on Neural Information Processing System. 2013: 3111-3119.
[15] Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2014: 1532-1543.
[16] Le Q, Mikolov T. Distributed Representations of Sentences and Documents[C]// Proceedings of the 31st International Conference on Machine Learning. 2014: 1188-1196.
[17] Li C Z, Guo J Y, Lu Y, et al.LDA Meets Word2Vec: A Novel Model for Academic Abstract Clustering[C]// Proceedings of the 2018 Web Conference Companion. 2018: 1699-1706.
[18] 阮光册, 夏磊. 基于Doc2Vec的期刊论文热点选题识别[J]. 情报理论与实践, 2019, 42
[18] (Ruan Guangce, Xia Lei. Hot Topic Detection in Journal Papers Based on Doc2Vec[J]. Information Studies: Theory & Application, 2019, 42(4): 107-111.)
[19] 高永兵, 杨贵朋, 张娣, 等. 基于突显词博文聚类的官微事件检测方法[J]. 数据分析与知识发现, 2017, 1(9): 57-64.
[19] (Gao Yongbing, Yang Guipeng, Zhang Di, et al. Detecting Events from Official Weibo Profiles Based on Post Clustering with Burst Words[J]. Data Analysis and Knowledge Discovery, 2017, 1(9): 57-64.)
[20] Kipf T N, Welling M. Semi-Supervised Classification with Graph Convolutional Networks[OL]. arXiv Preprint, arXiv: 1609.02907.
[21] Hamilton W L, Ying R, Leskovec J. Inductive Representation Learning on Large Graphs[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 1025-1035.
[22] Wang X, Ji H Y, Shi C, et al. Heterogeneous Graph Attention Network[C]// Proceedings of the 2019 World Wide Web Conference. ACM, 2019: 2022-2032.
[23] Fu X Y, Zhang J N, Meng Z Q, et al. MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding[C]// Proceedings of the 2020 World Wide Web Conference. ACM, 2020: 2331-2341.
[24] Hu Z N, Dong Y X, Wang K S, et al. Heterogeneous Graph Transformer[C]// Proceedings of the 2020 World Wide Web Conference. ACM, 2020: 2704-2710.
[25] Wang X, Liu N, Han H, et al. Self-Supervised Heterogeneous Graph Neural Network with Co-Contrastive Learning[C]// Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021: 1726-1736.
[26] Jin D, Huo C Y, Liang C D, et al. Heterogeneous Graph Neural Network via Attribute Completion[C]// Proceedings of the 2020 World Wide Web Conference. ACM, 2021: 391-400.
[27] Bastings J, Titov I, Aziz W, et al. Graph Convolutional Encoders for Syntax-Aware Neural Machine Translation[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 1957-1967.
[28] Yao L, Mao C S, Luo Y. Graph Convolutional Networks for Text Classification[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33: 7370-7377.
doi: 10.1609/aaai.v33i01.33017370
[29] Yang T C, Hu L M, Shi C, et al. HGAT: Heterogeneous Graph Attention Networks for Semi-Supervised Short Text Classification[J]. ACM Transactions on Information Systems, 2021, 39(3): 1-29.
[30] Lai Y N, Zhang L F, Han D H, et al. Fine-Grained Emotion Classification of Chinese Microblogs Based on Graph Convolution Networks[J]. World Wide Web, 2020, 23(5): 2771-2787.
doi: 10.1007/s11280-020-00803-0
[31] 范涛, 王昊, 吴鹏. 基于图卷积神经网络和依存句法分析的网民负面情感分析研究[J]. 数据分析与知识发现, 2021, 5(9): 97-106.
[31] (Fan Tao, Wang Hao, Wu Peng. Sentiment Analysis of Online Users’ Negative Emotions Based on Graph Convolutional Network and Dependency Parsing[J]. Data Analysis and Knowledge Discovery, 2021, 5(9): 97-106.)
[32] 周泽聿, 王昊, 赵梓博, 等. 融合关联信息的GCN文本分类模型构建及其应用研究[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[32] (Zhou Zeyu, Wang Hao, Zhao Zibo, et al. Construction and Application of GCN Model for Text Classification with Associated Information[J]. Data Analysis and Knowledge Discovery, 2021, 5(9): 31-41.)
[33] Zhou J, Cui G Q, Hu S D, et al. Graph Neural Networks: A Review of Methods and Applications[J]. AI Open, 2020, 1: 57-81.
doi: 10.1016/j.aiopen.2021.01.001
[34] Dong Y X, Chawla N V, Swami A.Metapath2Vec: Scalable Representation Learning for Heterogeneous Networks[C]// Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017: 135-144.
[35] Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online Learning of Social Representations[C]// Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 701-710.
[1] Wu Jiang, Liu Tao, Liu Yang. Mining Online User Profiles and Self-Presentations: Case Study of NetEase Music Community[J]. 数据分析与知识发现, 2022, 6(7): 56-69.
[2] Hu Jiming, Zheng Xiang. Abstracting Interactive Contents from New Media for Government Affairs Based on Topic Clustering[J]. 数据分析与知识发现, 2022, 6(6): 95-104.
[3] Deng Qiping, Chen Weijing, Ji Ling, Zhang Yu’e. Author Name Disambiguation Based on Heterogeneous Information Network[J]. 数据分析与知识发现, 2022, 6(4): 60-68.
[4] Li Xueli, Huang Linghe, Chen Jiaxing. Influencing Factors of Social Media Users’ Intentions to Disclose Privacy[J]. 数据分析与知识发现, 2022, 6(4): 97-107.
[5] Li Gang, Zhang Ji, Mao Jin. Social Media Image Classification for Emergency Portrait[J]. 数据分析与知识发现, 2022, 6(2/3): 67-79.
[6] An Lu, Xu Manting. Measuring Online Trust in Government Microblogs in Public Health Emergencies[J]. 数据分析与知识发现, 2022, 6(1): 55-68.
[7] Wang Ruolin, Niu Zhendong, Lin Qika, Zhu Yifan, Qiu Ping, Lu Hao, Liu Donglei. Disambiguating Author Names with Embedding Heterogeneous Information and Attentive RNN Clustering Parameters[J]. 数据分析与知识发现, 2021, 5(8): 13-24.
[8] Wang Qinjie, Qin Chunxiu, Ma Xubu, Liu Huailiang, Xu Cunzhen. Recommending Scientific Literature Based on Author Preference and Heterogeneous Information Network[J]. 数据分析与知识发现, 2021, 5(8): 54-64.
[9] Xie Hao,Mao Jin,Li Gang. Sentiment Classification of Image-Text Information with Multi-Layer Semantic Fusion[J]. 数据分析与知识发现, 2021, 5(6): 103-114.
[10] Ma Yingxue,Zhao Jichang. Patterns and Evolution of Public Opinion on Weibo During Natural Disasters: Case Study of Typhoons and Rainstorms[J]. 数据分析与知识发现, 2021, 5(6): 66-79.
[11] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[12] Liu Qian, Li Chenliang. A Survey of Topic Evolution on Social Media[J]. 数据分析与知识发现, 2020, 4(8): 1-14.
[13] Li Gang, Guan Weidong, Ma Yaxue, Mao Jin. Predicting Social Media Visibility of Scholarly Articles[J]. 数据分析与知识发现, 2020, 4(8): 63-74.
[14] Wang Gensheng,Pan Fangzheng. Matrix Factorization Algorithm with Weighted Heterogeneous Information Network[J]. 数据分析与知识发现, 2020, 4(12): 76-84.
[15] Ying Tan,Jin Zhang,Lixin Xia. A Survey of Sentiment Analysis on Social Media[J]. 数据分析与知识发现, 2020, 4(1): 1-11.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn