基于异构图卷积网络的网络社区敏感文本分类模型<sup>*</sup>

doi:10.11925/infotech.2096-3467.2022.1250

数据分析与知识发现

2023, Vol. 7

Issue (11): 26-36 https://doi.org/10.11925/infotech.2096-3467.2022.1250

研究论文

本期目录 | 过刊浏览 | 高级检索

基于异构图卷积网络的网络社区敏感文本分类模型^*

高浩鑫^1,²,孙利娟^1,³,吴京宸^1,⁴,高宇童⁶,吴旭^1,^2,⁵(

)

¹北京邮电大学可信分布式计算与服务教育部重点实验室北京 100876
²北京邮电大学网络空间安全学院北京 100876
³北京邮电大学经济管理学院北京 100876
⁴北京邮电大学计算机学院（国家示范性软件学院）北京 100876
⁵北京邮电大学图书馆北京 100876
⁶北京交通大学计算机与信息技术学院北京 100044

Online Sensitive Text Classification Model Based on Heterogeneous Graph Convolutional Network

Gao Haoxin^1,²,Sun Lijuan^1,³,Wu Jingchen^1,⁴,Gao Yutong⁶,Wu Xu^1,^2,⁵(

)

¹Key Laboratory of Trustworthy Distributed Computing and Service, Beijing University of Posts and Telecommunications, Beijing 100876, China
²School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China
³School of Economics and Management, Beijing University of Posts and Telecommunications, Beijing 100876, China
⁴School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China
⁵Beijing University of Posts and Telecommunications Library, Beijing 100876, China
⁶School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (1006 KB) HTML ( 24 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】基于图神经网络设计一种针对网络社区中敏感文本的分类模型，为治理网络舆情、维护网络社区信息安全提供帮助。【方法】在文本和词的基础上添加敏感实体构造异构图，引入网络舆情敏感信息的先验知识，然后利用BERT捕获文本的深度语义信息，使用图卷积网络（GCN）获取全局的共现特征，结合两者获得预训练模型和图模型的互补优势，适应长短文本之间的结构差异，最后根据基于网络社区舆情特点设计的敏感文本分类体系进行分类。【结果】在网络舆情敏感文本自制数据集上进行广泛的实验，实验结果表明，所提模型准确率达到70.80%，相较于基线模型至少提高3.52个百分点。【局限】在大语料库上构建的异构图过大会影响计算速度。【结论】所提模型能够适应网络社区敏感文本的结构差异，更好地捕捉文本中的敏感特征以提升分类性能，在敏感文本分类上有较好的效果。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	高浩鑫
	孙利娟
	吴京宸
	高宇童
	吴旭

关键词 ：图卷积网络, 敏感文本分类, 异构图, BERT

Abstract：

[Objective] This paper proposes a classification model for sensitive texts in online communities based on a graph neural network, which supports public opinion governance and information security. [Methods] First, we constructed a heterogeneous graph based on sensitive entities of texts and words, which included the existing knowledge about sensitive information of online public opinion. Second, we adopted BERT and GCN to capture high-level semantic information of the text and global co-occurrence features. Third, we combined the complementary advantages of pre-training and graph models to address heterogeneous issues due to structural differences between long and short texts. Finally, we classified sensitive texts based on features of online public opinion. [Results] We examined the proposed model on a self-made sensitive text dataset of online public opinion. The accuracy of our method reached 70.80%, which was 3.52% higher than that of other models. [Limitations] Large heterogeneous graphs built on long texts will reduce the computing speed. [Conclusions] The proposed model could effectively identify and classify sensitive content from different online texts.

Key words： Graph Convolutional Network Sensitive Text Classification Heterogeneous Graph BERT

收稿日期: 2022-11-23 出版日期: 2023-03-22

ZTFLH:

TP183 G350

基金资助:*国家自然科学基金重大项目(72293583);中国博士后科学基金面上项目的研究成果之一(2022M710463)

通讯作者: 吴旭，ORCID：0000-0002-1297-2726，E-mail： wux@bupt.edu.cn。

引用本文:

高浩鑫, 孙利娟, 吴京宸, 高宇童, 吴旭. 基于异构图卷积网络的网络社区敏感文本分类模型^*[J]. 数据分析与知识发现, 2023, 7(11): 26-36.
Gao Haoxin, Sun Lijuan, Wu Jingchen, Gao Yutong, Wu Xu. Online Sensitive Text Classification Model Based on Heterogeneous Graph Convolutional Network. Data Analysis and Knowledge Discovery, 2023, 7(11): 26-36.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.1250 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I11/26

Fig.1 STC-HGCN模型架构

Fig.2 异构图示例

Fig.3 网络社区敏感文本的图卷积网络示意图

Fig.4 教育领域舆情敏感文本分类体系

Table 1 敏感文本数据集示例

Table 2 数据集信息

Table 3 对比实验结果

Table 4 消融研究结果

[1]	Maron M E. Automatic Indexing: An Experimental Inquiry[J]. Journal of the ACM, 1961, 8(3): 404-417. doi: 10.1145/321075.321084
[2]	Cover T, Hart P. Nearest Neighbor Pattern Classification[J]. IEEE Transactions on Information Theory, 1967, 13(1): 21-27. doi: 10.1109/TIT.1967.1053964
[3]	Drucker H, Wu D, Vapnik V N. Support Vector Machines for Spam Categorization[J]. IEEE Transactions on Neural Networks, 1999, 10(5): 1048-1054. doi: 10.1109/72.788645 pmid: 18252607
[4]	Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
[5]	Liu P F, Qiu X P, Huang X J. Recurrent Neural Network for Text Classification with Multi-Task Learning[OL]. arXiv Preprint, arXiv: 1605.05101.
[6]	Tai K S, Socher R, Manning C D. Improved Semantic Representations from Tree-Structured Long Short-Term Memory Networks[OL]. arXiv Preprint, arXiv:1503.00075.
[7]	Lai S W, Xu L H, Liu K, et al. Recurrent Convolutional Neural Networks for Text Classification[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015: 2267-2273.
[8]	Wu Z H, Pan S R, Chen F W, et al. A Comprehensive Survey on Graph Neural Networks[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(1): 4-24. doi: 10.1109/TNNLS.5962385
[9]	Kipf T N, Welling M. Semi-Supervised Classification with Graph Convolutional Networks[OL]. arXiv Preprint, arXiv:1609.02907.
[10]	Yao L, Mao C S, Luo Y. Graph Convolutional Networks for Text Classification[C]// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019: 7370-7377.
[11]	Huang L Z, Ma D H, Li S J, et al. Text Level Graph Neural Network for Text Classification[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 3444-3450.
[12]	Zhang Y F, Yu X L, Cui Z Y, et al. Every Document Owns Its Structure: Inductive Text Classification via Graph Neural Networks[OL]. arXiv Preprint, arXiv: 2004.13826.
[13]	Hu L M, Yang T C, Shi C, et al. Heterogeneous Graph Attention Networks for Semi-supervised Short Text Classification[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 4821-4830.
[14]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[15]	Lin Y X, Meng Y X, Sun X F, et al. BertGCN: Transductive Text Classification by Combining GCN and BERT[OL]. arXiv Preprint, arXiv: 2105.05727.
[16]	Yu Z X, Wu X, Xie X Q, et al. Hot Event Detection for Social Media Based on Keyword Semantic Information[C]// Proceedings of 2019 IEEE 4th International Conference on Data Science in Cyberspace. 2019: 410-415.
[17]	Gao L, Wu X, Wu J C, et al. Sensitive Image Information Recognition Model of Network Community Based on Content Text[C]// Proceedings of 2021 IEEE 6th International Conference on Data Science in Cyberspace. 2021: 47-52.
[18]	陈祖琴, 蒋勋, 葛继科. 基于网络舆情敏感信息的突发事件情景分析[J]. 现代情报, 2021, 41(5): 25-32. doi: 10.3969/j.issn.1008-0821.2021.05.003
[18]	(Chen Zuqin, Jiang Xun, Ge Jike. Emergency Scenario Analysis Based on Sensitive Information of Online Public Opinion[J]. Journal of Modern Information, 2021, 41(5): 25-32.) doi: 10.3969/j.issn.1008-0821.2021.05.003
[19]	张泽锋, 毛存礼, 余正涛, 等. 融入领域术语词典的司法舆情敏感信息识别[J]. 中文信息学报, 2022, 36(9): 76-83, 92.
[19]	(Zhang Zefeng, Mao Cunli, Yu Zhengtao, et al. Sensitive Judicial Public Opinion Information Recognition with the Domain Terminology Dictionary[J]. Journal of Chinese Information Processing, 2022, 36(9): 76-83, 92.)
[20]	Zeng J C, Li J, Song Y, et al. Topic Memory Networks for Short Text Classification[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2018: 3120-3131.
[21]	Wang X, Chen R H, Jia Y, et al. Short Text Classification Using Wikipedia Concept Based Document Representation[C]// Proceedings of the International Conference on Information Technology and Applications. 2013: 471-474.
[22]	Lan G, Li Y, Hu M T, et al. Knowledge Graph Integrated Graph Neural Networks for Chinese Medical Text Classification[C]// Proceedings of IEEE International Conference on Bioinformatics and Biomedicine. 2021: 682-687.
[23]	Li Q M, Han Z C, Wu X M. Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018.
[24]	Zhou P, Shi W, Tian J, et al. Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2:Short Papers). 2016: 207-212.
[25]	Joulin A, Grave E, Bojanowski P, et al. Bag of Tricks for Efficient Text Classification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2. 2017: 427-431.
[26]	Johnson R, Zhang T. Deep Pyramid Convolutional Neural Networks for Text Categorization[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2017: 562-570.
[27]	Kingma D P, Ba J. Adam: A Method for Stochastic Optimization[OL]. arXiv Preprint, arXiv:1412.6980.

[1]	贺超城, 黄茜, 李欣儒, 王春迎, 吴江. 元宇宙的冷与热——融合BERT与动态主题模型的微博文本分析^*[J]. 数据分析与知识发现, 2023, 7(9): 25-38.
[2]	赵雪峰, 吴德林, 吴伟伟, 孙卓荦, 胡瑾瑾, 廉莹, 单佳宇. 基于深度学习与多分类轮询机制的高质量“卡脖子”技术专利识别模型——以专利申请文件为研究主体*[J]. 数据分析与知识发现, 2023, 7(8): 30-45.
[3]	刘洋, 丁星辰, 马莉莉, 王淳洋, 朱立芳. 基于多维度图卷积网络的旅游评论有用性识别*[J]. 数据分析与知识发现, 2023, 7(8): 95-104.
[4]	胥桂仙, 张子欣, 于绍娜, 董玉双, 田媛. 基于图卷积网络的藏文新闻文本分类^*[J]. 数据分析与知识发现, 2023, 7(6): 73-85.
[5]	徐康, 余胜男, 陈蕾, 王传栋. 基于语言学知识增强的自监督式图卷积网络的事件关系抽取方法^*[J]. 数据分析与知识发现, 2023, 7(5): 92-104.
[6]	本妍妍, 庞雪芹. 融入词性的医疗命名实体识别研究^*[J]. 数据分析与知识发现, 2023, 7(5): 123-132.
[7]	苏明星, 吴厚月, 李健, 黄菊, 张顺香. 基于多层交互注意力机制的商品属性抽取^*[J]. 数据分析与知识发现, 2023, 7(2): 108-118.
[8]	张贞港, 余传明. 基于实体与关系融合的知识图谱补全模型研究^*[J]. 数据分析与知识发现, 2023, 7(2): 15-25.
[9]	赵一鸣, 潘沛, 毛进. 基于任务知识融合与文本数据增强的医学信息查询意图强度识别研究^*[J]. 数据分析与知识发现, 2023, 7(2): 38-47.
[10]	王宇飞, 张智雄, 赵旸, 张梦婷, 李雪思. 中文科技论文标题自动生成系统的设计与实现^*[J]. 数据分析与知识发现, 2023, 7(2): 61-71.
[11]	张思阳, 魏苏波, 孙争艳, 张顺香, 朱广丽, 吴厚月. 基于多标签Seq2Seq模型的情绪-原因对提取模型^*[J]. 数据分析与知识发现, 2023, 7(2): 86-96.
[12]	刘赏, 沈逸凡. 基于新闻标题-正文差异性的虚假新闻检测方法^*[J]. 数据分析与知识发现, 2023, 7(2): 97-107.
[13]	李楠, 汪波. 跨学科语义漂移识别与可视化分析^*[J]. 数据分析与知识发现, 2023, 7(10): 15-24.
[14]	潘小宇, 倪渊, 金春华, 张健. 基于超平面-BERT-Louvain优化LDA模型的书法作品价值要素提取及指标体系构建^*[J]. 数据分析与知识发现, 2023, 7(10): 109-118.
[15]	施运梅, 袁博, 张乐, 吕学强. IMTS：融合图像与文本语义的虚假评论检测方法*[J]. 数据分析与知识发现, 2022, 6(8): 84-96.

Viewed

Full text

Abstract

Cited

Shared

Discussed