Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (10): 9-19     https://doi.org/10.11925/infotech.2096-3467.2022.0038
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于异构图神经网络的社交媒体文本主题聚类*
冯小东(),惠康欣
电子科技大学公共管理学院 成都 611731
Topic Clustering for Social Media Texts with Heterogeneous Graph Neural Networks
Feng Xiaodong(),Hui Kangxin
School of Public Affairs and Administration, University of Electronic Science and Technology of China, Chengdu 611731, China
全文: PDF (2101 KB)   HTML ( 40
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 针对社交媒体文本数据存在的语义稀疏及多元主体交互问题,探索有效的主题聚类方法。【方法】 利用异构信息网络对社交媒体的用户和信息多元交互关系进行建模,使用词嵌入方法学习文本的向量表示作为初始输入特征,基于异构图神经网络实现信息的传播及融合,学习文本表示向量并利用无监督聚类算法进行主题聚类。【结果】 在基准社交媒体数据集上,帖子和评论的聚类指标(NMI)分别达到0.837 2和0.868 9,优于传统的LDA主题模型或基于Word2Vec、Doc2Vec、GolVe等词或文本嵌入向量直接聚类的方法。【局限】 由于数据的限制,模型并未对用户间社交关系及信息的多媒体内容进行建模。【结论】 本文方法通过对社交媒体多元交互关系进行建模,能有效提高文本主题聚类的效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
冯小东
惠康欣
关键词 社交媒体主题聚类多元交互异构信息网络图神经网络    
Abstract

[Objective] This paper develops an effective topic clustering method to address the issues of semantic sparsity and multiple interactions of social media texts. [Methods] We constructed a model for the multiple interaction relationship between social media users and online contents with the help of heterogeneous information network. First, we used word embedding method to obtain the representation of texts as the initial input features. Then, we propagated and aggregated representations of nodes with the heterogeneous graph neural network. Finally, we trained the model with representation of text nodes, and conducted an unsupervised clustering for the topics. [Results] We examined our model on the English benchmark data set, and found its NMI for original posts and comments reached 0.837 2 and 0.868 9 respectively, which were higher than those of the traditional LDA or directly clustering method with words or text embedding vectors by Word2Vec, Doc2Vec, or GolVe. [Limitations] Due to the limits of data, we did not examine the social relationship among users and multimedia contents online. [Conclusions] The proposed model can effectively improve the topic clustering for social media texts.

Key wordsSocial Media    Topic Clustering    Multiple Interactions    Heterogeneous Information Network    Graph Neural Networks
收稿日期: 2022-01-13      出版日期: 2022-11-16
ZTFLH:  TP391 G35  
基金资助:教育部人文社会科学基金一般项目(20YJAZH027);国家自然科学基金青年基金项目(72004021)
通讯作者: 冯小东,ORCID:0000-0001-9975-9807      E-mail: fengxd1988@hotmail.com
引用本文:   
冯小东, 惠康欣. 基于异构图神经网络的社交媒体文本主题聚类*[J]. 数据分析与知识发现, 2022, 6(10): 9-19.
Feng Xiaodong, Hui Kangxin. Topic Clustering for Social Media Texts with Heterogeneous Graph Neural Networks. Data Analysis and Knowledge Discovery, 2022, 6(10): 9-19.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0038      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I10/9
Fig.1  社交媒体文本主题聚类研究框架
Fig.2  异构信息网络实例
Fig.3  comment长度分布与comment数量分布
Subreddit(主题) link数量 用户数量 comment数量 一级comment数量 子comment数量
movies 359 7 647 10 902 6 858 4 044
news 268 9 224 14 059 6 571 7 488
NFL 219 3 395 5 029 3 136 1 893
pcmasterrace 363 4 569 6 384 4 569 1 815
relationships 281 7 186 17 234 11 422 5 812
Table1  Reddit数据集统计
Fig.4  聚类指标随迭代次数变化
聚类方法 link comment
NMI ARI NMI ARI
LDA 0.123 3 0.082 0 0.010 7 0.000 6
Word2Vec 0.609 8 0.589 4 0.373 7 0.371 8
GloVe 0.549 3 0.408 8 0.317 6 0.311 9
Doc2Vec 0.746 1 0.786 7 0.055 8 0.039 4
HGNN-Topic Word2Vec输入 0.659 5 0.608 9 0.866 8 0.917 5
GloVe输入 0.614 5 0.492 4 0.868 4 0.916 5
Doc2Vec输入 0.837 2 0.852 0 0.868 9 0.918 1
Table 2  不同方法聚类结果对比
Fig.5  科学网文本聚类的不同主题词云
[1] 颜端武, 梅喜瑞, 杨雄飞, 等. 基于主题模型和词向量融合的微博文本主题聚类研究[J]. 现代情报, 2021, 41(10): 67-74.
doi: 10.3969/j.issn.1008-0821.2021.10.008
[1] (Yan Duanwu, Mei Xirui, Yang Xiongfei, et al. Research on Microblog Text Topic Clustering Based on the Fusion of Topic Model and Word Embedding[J]. Journal of Modern Information, 2021, 41(10): 67-74.)
doi: 10.3969/j.issn.1008-0821.2021.10.008
[2] Li X M, Li C C, Chi J J, et al. Short Text Topic Modeling by Exploring Original Documents[J]. Knowledge and Information Systems, 2018, 56(2): 443-462.
doi: 10.1007/s10115-017-1099-0
[3] Mehrotra R, Sanner S, Buntine W, et al. Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling[C]// Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2013: 889-892.
[4] Vavliakis K N, Symeonidis A L, Mitkas P A. Event Identification in Web Social Media Through Named Entity Recognition and Topic Modeling[J]. Data & Knowledge Engineering, 2013, 88: 1-24.
doi: 10.1016/j.datak.2013.08.006
[5] Curiskis S A, Drake B, Osborn T R, et al. An Evaluation of Document Clustering and Topic Modelling in Two Online Social Networks: Twitter and Reddit[J]. Information Processing & Management, 2020, 57(2): 102034.
doi: 10.1016/j.ipm.2019.04.002
[6] Wu S Z, Zhang H P, Xu C C, et al. Text Clustering on Short Message by Using Deep Semantic Representation[C]// Proceedings of the 4th International Conference on Computer, Communication and Computational Sciences. 2019: 133-145.
[7] Zhang C X, Song D J, Huang C, et al. Heterogeneous Graph Neural Network[C]// Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019: 793-803.
[8] Xu S Y, Yang C, Shi C, et al. Topic-Aware Heterogeneous Graph Neural Network for Link Prediction[C]// Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2021: 2261-2270.
[9] Allan J. Topic Detection and Tracking: Event-Based Information Organization[M]. Springer Science & Business Media, 2012.
[10] Yang Y M, Pierce T, Carbonell J. A Study of Retrospective and On-Line Event Detection[C]// Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1998: 28-36.
[11] Pons-Porrata A, Berlanga-Llavori R, Ruiz-Shulcloper J. Topic Discovery Based on Text Mining Techniques[J]. Information Processing & Management, 2007, 43(3): 752-768.
doi: 10.1016/j.ipm.2006.06.001
[12] 蔡永明, 长青. 共词网络LDA模型的中文短文本主题分析[J]. 情报学报, 2018, 37(3): 305-317.
[12] (Cai Yongming, Chang Qing. Chinese Short Text Topic Analysis by Latent Dirichlet Allocation Model with Co-Word Network Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(3): 305-317.)
[13] 王曰芬, 许杜娟, 杨振怡, 等. 舆情评论与新闻报道的话题识别及其主题关联分析[J]. 现代情报, 2018, 38(6): 3-10.
doi: 10.3969/j.issn.1008-0821.2018.06.001
[13] (Wang Yuefen, Xu Dujuan, Yang Zhenyi, et al. Topic Detection and Subject Association Analysis on Public Opinions and News Reports[J]. Journal of Modern Information, 2018, 38(6): 3-10.)
doi: 10.3969/j.issn.1008-0821.2018.06.001
[14] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 2013 Annual Conference on Neural Information Processing System. 2013: 3111-3119.
[15] Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2014: 1532-1543.
[16] Le Q, Mikolov T. Distributed Representations of Sentences and Documents[C]// Proceedings of the 31st International Conference on Machine Learning. 2014: 1188-1196.
[17] Li C Z, Guo J Y, Lu Y, et al.LDA Meets Word2Vec: A Novel Model for Academic Abstract Clustering[C]// Proceedings of the 2018 Web Conference Companion. 2018: 1699-1706.
[18] 阮光册, 夏磊. 基于Doc2Vec的期刊论文热点选题识别[J]. 情报理论与实践, 2019, 42
[18] (Ruan Guangce, Xia Lei. Hot Topic Detection in Journal Papers Based on Doc2Vec[J]. Information Studies: Theory & Application, 2019, 42(4): 107-111.)
[19] 高永兵, 杨贵朋, 张娣, 等. 基于突显词博文聚类的官微事件检测方法[J]. 数据分析与知识发现, 2017, 1(9): 57-64.
[19] (Gao Yongbing, Yang Guipeng, Zhang Di, et al. Detecting Events from Official Weibo Profiles Based on Post Clustering with Burst Words[J]. Data Analysis and Knowledge Discovery, 2017, 1(9): 57-64.)
[20] Kipf T N, Welling M. Semi-Supervised Classification with Graph Convolutional Networks[OL]. arXiv Preprint, arXiv: 1609.02907.
[21] Hamilton W L, Ying R, Leskovec J. Inductive Representation Learning on Large Graphs[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 1025-1035.
[22] Wang X, Ji H Y, Shi C, et al. Heterogeneous Graph Attention Network[C]// Proceedings of the 2019 World Wide Web Conference. ACM, 2019: 2022-2032.
[23] Fu X Y, Zhang J N, Meng Z Q, et al. MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding[C]// Proceedings of the 2020 World Wide Web Conference. ACM, 2020: 2331-2341.
[24] Hu Z N, Dong Y X, Wang K S, et al. Heterogeneous Graph Transformer[C]// Proceedings of the 2020 World Wide Web Conference. ACM, 2020: 2704-2710.
[25] Wang X, Liu N, Han H, et al. Self-Supervised Heterogeneous Graph Neural Network with Co-Contrastive Learning[C]// Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021: 1726-1736.
[26] Jin D, Huo C Y, Liang C D, et al. Heterogeneous Graph Neural Network via Attribute Completion[C]// Proceedings of the 2020 World Wide Web Conference. ACM, 2021: 391-400.
[27] Bastings J, Titov I, Aziz W, et al. Graph Convolutional Encoders for Syntax-Aware Neural Machine Translation[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 1957-1967.
[28] Yao L, Mao C S, Luo Y. Graph Convolutional Networks for Text Classification[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33: 7370-7377.
doi: 10.1609/aaai.v33i01.33017370
[29] Yang T C, Hu L M, Shi C, et al. HGAT: Heterogeneous Graph Attention Networks for Semi-Supervised Short Text Classification[J]. ACM Transactions on Information Systems, 2021, 39(3): 1-29.
[30] Lai Y N, Zhang L F, Han D H, et al. Fine-Grained Emotion Classification of Chinese Microblogs Based on Graph Convolution Networks[J]. World Wide Web, 2020, 23(5): 2771-2787.
doi: 10.1007/s11280-020-00803-0
[31] 范涛, 王昊, 吴鹏. 基于图卷积神经网络和依存句法分析的网民负面情感分析研究[J]. 数据分析与知识发现, 2021, 5(9): 97-106.
[31] (Fan Tao, Wang Hao, Wu Peng. Sentiment Analysis of Online Users’ Negative Emotions Based on Graph Convolutional Network and Dependency Parsing[J]. Data Analysis and Knowledge Discovery, 2021, 5(9): 97-106.)
[32] 周泽聿, 王昊, 赵梓博, 等. 融合关联信息的GCN文本分类模型构建及其应用研究[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[32] (Zhou Zeyu, Wang Hao, Zhao Zibo, et al. Construction and Application of GCN Model for Text Classification with Associated Information[J]. Data Analysis and Knowledge Discovery, 2021, 5(9): 31-41.)
[33] Zhou J, Cui G Q, Hu S D, et al. Graph Neural Networks: A Review of Methods and Applications[J]. AI Open, 2020, 1: 57-81.
doi: 10.1016/j.aiopen.2021.01.001
[34] Dong Y X, Chawla N V, Swami A.Metapath2Vec: Scalable Representation Learning for Heterogeneous Networks[C]// Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017: 135-144.
[35] Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online Learning of Social Representations[C]// Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 701-710.
[1] 成全, 佘德昕. 融合患者体征与用药数据的图神经网络药物推荐方法研究*[J]. 数据分析与知识发现, 2022, 6(9): 113-124.
[2] 吴江, 刘涛, 刘洋. 在线社区用户画像及自我呈现主题挖掘——以网易云音乐社区为例*[J]. 数据分析与知识发现, 2022, 6(7): 56-69.
[3] 张若琦, 申建芳, 陈平华. 结合GNN、Bi-GRU及注意力机制的会话序列推荐*[J]. 数据分析与知识发现, 2022, 6(6): 46-54.
[4] 胡吉明, 郑翔. 基于主题聚类的新媒体政务互动内容摘要生成研究*[J]. 数据分析与知识发现, 2022, 6(6): 95-104.
[5] 李雪丽, 黄令贺, 陈佳星. 基于元分析的社交媒体用户隐私披露意愿影响因素研究*[J]. 数据分析与知识发现, 2022, 6(4): 97-107.
[6] 李纲, 张霁, 毛进. 面向突发事件画像的社交媒体图像分类研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 67-79.
[7] 黄学坚, 刘雨飏, 马廷淮. 基于改进型图神经网络的学术论文分类模型*[J]. 数据分析与知识发现, 2022, 6(10): 93-102.
[8] 安璐, 徐曼婷. 突发公共卫生事件情境下网民对政务微博信任度的测度*[J]. 数据分析与知识发现, 2022, 6(1): 55-68.
[9] 王勤洁, 秦春秀, 马续补, 刘怀亮, 徐存真. 基于作者偏好和异构信息网络的科技文献推荐方法研究*[J]. 数据分析与知识发现, 2021, 5(8): 54-64.
[10] 顾耀文, 张博文, 郑思, 杨丰春, 李姣. 基于图注意力网络的药物ADMET分类预测模型构建方法*[J]. 数据分析与知识发现, 2021, 5(8): 76-85.
[11] 谢豪,毛进,李纲. 基于多层语义融合的图文信息情感分类研究*[J]. 数据分析与知识发现, 2021, 5(6): 103-114.
[12] 马莹雪,赵吉昌. 自然灾害期间微博平台的舆情特征及演变*——以台风和暴雨数据为例[J]. 数据分析与知识发现, 2021, 5(6): 66-79.
[13] 张国标,李洁. 融合多模态内容语义一致性的社交媒体虚假新闻检测*[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[14] 刘倩, 李晨亮. 基于社交媒体的话题演变研究综述*[J]. 数据分析与知识发现, 2020, 4(8): 1-14.
[15] 李纲, 管为栋, 马亚雪, 毛进. 学术论文的社交媒体可见性预测研究*[J]. 数据分析与知识发现, 2020, 4(8): 63-74.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn