Please wait a minute...
Advanced Search
数据分析与知识发现  2023, Vol. 7 Issue (11): 26-36     https://doi.org/10.11925/infotech.2096-3467.2022.1250
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于异构图卷积网络的网络社区敏感文本分类模型*
高浩鑫1,2,孙利娟1,3,吴京宸1,4,高宇童6,吴旭1,2,5()
1北京邮电大学可信分布式计算与服务教育部重点实验室 北京 100876
2北京邮电大学网络空间安全学院 北京 100876
3北京邮电大学经济管理学院 北京 100876
4北京邮电大学计算机学院(国家示范性软件学院) 北京 100876
5北京邮电大学图书馆 北京 100876
6北京交通大学计算机与信息技术学院 北京 100044
Online Sensitive Text Classification Model Based on Heterogeneous Graph Convolutional Network
Gao Haoxin1,2,Sun Lijuan1,3,Wu Jingchen1,4,Gao Yutong6,Wu Xu1,2,5()
1Key Laboratory of Trustworthy Distributed Computing and Service, Beijing University of Posts and Telecommunications, Beijing 100876, China
2School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China
3School of Economics and Management, Beijing University of Posts and Telecommunications, Beijing 100876, China
4School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China
5Beijing University of Posts and Telecommunications Library, Beijing 100876, China
6School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
全文: PDF (1006 KB)   HTML ( 24
输出: BibTeX | EndNote (RIS)      
摘要 

目的】 基于图神经网络设计一种针对网络社区中敏感文本的分类模型,为治理网络舆情、维护网络社区信息安全提供帮助。【方法】 在文本和词的基础上添加敏感实体构造异构图,引入网络舆情敏感信息的先验知识,然后利用BERT捕获文本的深度语义信息,使用图卷积网络(GCN)获取全局的共现特征,结合两者获得预训练模型和图模型的互补优势,适应长短文本之间的结构差异,最后根据基于网络社区舆情特点设计的敏感文本分类体系进行分类。【结果】 在网络舆情敏感文本自制数据集上进行广泛的实验,实验结果表明,所提模型准确率达到70.80%,相较于基线模型至少提高3.52个百分点。【局限】 在大语料库上构建的异构图过大会影响计算速度。【结论】 所提模型能够适应网络社区敏感文本的结构差异,更好地捕捉文本中的敏感特征以提升分类性能,在敏感文本分类上有较好的效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
高浩鑫
孙利娟
吴京宸
高宇童
吴旭
关键词 图卷积网络敏感文本分类异构图BERT    
Abstract

[Objective] This paper proposes a classification model for sensitive texts in online communities based on a graph neural network, which supports public opinion governance and information security. [Methods] First, we constructed a heterogeneous graph based on sensitive entities of texts and words, which included the existing knowledge about sensitive information of online public opinion. Second, we adopted BERT and GCN to capture high-level semantic information of the text and global co-occurrence features. Third, we combined the complementary advantages of pre-training and graph models to address heterogeneous issues due to structural differences between long and short texts. Finally, we classified sensitive texts based on features of online public opinion. [Results] We examined the proposed model on a self-made sensitive text dataset of online public opinion. The accuracy of our method reached 70.80%, which was 3.52% higher than that of other models. [Limitations] Large heterogeneous graphs built on long texts will reduce the computing speed. [Conclusions] The proposed model could effectively identify and classify sensitive content from different online texts.

Key wordsGraph Convolutional Network    Sensitive Text Classification    Heterogeneous Graph    BERT
收稿日期: 2022-11-23      出版日期: 2023-03-22
ZTFLH:  TP183 G350  
基金资助:*国家自然科学基金重大项目(72293583);中国博士后科学基金面上项目的研究成果之一(2022M710463)
通讯作者: 吴旭,ORCID:0000-0002-1297-2726,E-mail: wux@bupt.edu.cn。   
引用本文:   
高浩鑫, 孙利娟, 吴京宸, 高宇童, 吴旭. 基于异构图卷积网络的网络社区敏感文本分类模型*[J]. 数据分析与知识发现, 2023, 7(11): 26-36.
Gao Haoxin, Sun Lijuan, Wu Jingchen, Gao Yutong, Wu Xu. Online Sensitive Text Classification Model Based on Heterogeneous Graph Convolutional Network. Data Analysis and Knowledge Discovery, 2023, 7(11): 26-36.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.1250      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I11/26
Fig.1  STC-HGCN模型架构
Fig.2  异构图示例
Fig.3  网络社区敏感文本的图卷积网络示意图
Fig.4  教育领域舆情敏感文本分类体系
类别 文本 来源
安全类 美得州枪击案细节披露:枪手将自己反锁在教室扫射孩子们无路可逃。 虎扑
管理类 当然不仅仅是听叶财德的意见,他只是一个社区防控的代表,恰逢其会被推到了台前… 水木社区
声誉类 华为让人反感不是没有缘由的,哪个应用离开华为没法运行?资费下降不感谢运营商… 水木社区
学术类 研究方法扎实固然很重要,也代表了学术水平高。但是学术水平高… 虎扑
灾害类 风暴席卷加拿大东部致8死:大批电线杆折断 50万人遭断电。 环球网
政治类 外交部应该直接明确的说清楚,乌克兰现政府是… 观察者网
Table 1  敏感文本数据集示例
类别 数量 平均长度 最大长度 最小长度
安全类 386 438 6 279 12
管理类 7 954 997 24 875 10
声誉类 2 250 831 32 767 11
学术类 2 323 1 098 28 442 15
灾害类 82 471 5 141 22
政治类 709 709 32 767 8
全部 13 704 890 32 767 8
Table 2  数据集信息
模型 准确率
/%
加权平均
精确率/%
加权平均
召回率/%
加权平均
F1值/%
TextRNN 60.83 62.34 60.83 56.09
FastText 63.73 63.51 63.73 60.95
TextRNN-Att 64.58 63.11 64.58 62.76
DPCNN 64.72 64.90 64.72 62.55
TextCNN 66.15 64.63 65.15 63.08
TextRCNN 66.29 65.40 66.29 64.50
BERT 67.28 68.28 67.28 66.22
STC-HGCN 70.80 71.10 70.80 70.23
Table 3  对比实验结果
模型 准确率/%
①移除实体节点 68.42
②移除GCN 67.28
③移除BERT 68.64
完整模型 70.80
Table 4  消融研究结果
[1] Maron M E. Automatic Indexing: An Experimental Inquiry[J]. Journal of the ACM, 1961, 8(3): 404-417.
doi: 10.1145/321075.321084
[2] Cover T, Hart P. Nearest Neighbor Pattern Classification[J]. IEEE Transactions on Information Theory, 1967, 13(1): 21-27.
doi: 10.1109/TIT.1967.1053964
[3] Drucker H, Wu D, Vapnik V N. Support Vector Machines for Spam Categorization[J]. IEEE Transactions on Neural Networks, 1999, 10(5): 1048-1054.
doi: 10.1109/72.788645 pmid: 18252607
[4] Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
[5] Liu P F, Qiu X P, Huang X J. Recurrent Neural Network for Text Classification with Multi-Task Learning[OL]. arXiv Preprint, arXiv: 1605.05101.
[6] Tai K S, Socher R, Manning C D. Improved Semantic Representations from Tree-Structured Long Short-Term Memory Networks[OL]. arXiv Preprint, arXiv:1503.00075.
[7] Lai S W, Xu L H, Liu K, et al. Recurrent Convolutional Neural Networks for Text Classification[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015: 2267-2273.
[8] Wu Z H, Pan S R, Chen F W, et al. A Comprehensive Survey on Graph Neural Networks[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(1): 4-24.
doi: 10.1109/TNNLS.5962385
[9] Kipf T N, Welling M. Semi-Supervised Classification with Graph Convolutional Networks[OL]. arXiv Preprint, arXiv:1609.02907.
[10] Yao L, Mao C S, Luo Y. Graph Convolutional Networks for Text Classification[C]// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019: 7370-7377.
[11] Huang L Z, Ma D H, Li S J, et al. Text Level Graph Neural Network for Text Classification[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 3444-3450.
[12] Zhang Y F, Yu X L, Cui Z Y, et al. Every Document Owns Its Structure: Inductive Text Classification via Graph Neural Networks[OL]. arXiv Preprint, arXiv: 2004.13826.
[13] Hu L M, Yang T C, Shi C, et al. Heterogeneous Graph Attention Networks for Semi-supervised Short Text Classification[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 4821-4830.
[14] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[15] Lin Y X, Meng Y X, Sun X F, et al. BertGCN: Transductive Text Classification by Combining GCN and BERT[OL]. arXiv Preprint, arXiv: 2105.05727.
[16] Yu Z X, Wu X, Xie X Q, et al. Hot Event Detection for Social Media Based on Keyword Semantic Information[C]// Proceedings of 2019 IEEE 4th International Conference on Data Science in Cyberspace. 2019: 410-415.
[17] Gao L, Wu X, Wu J C, et al. Sensitive Image Information Recognition Model of Network Community Based on Content Text[C]// Proceedings of 2021 IEEE 6th International Conference on Data Science in Cyberspace. 2021: 47-52.
[18] 陈祖琴, 蒋勋, 葛继科. 基于网络舆情敏感信息的突发事件情景分析[J]. 现代情报, 2021, 41(5): 25-32.
doi: 10.3969/j.issn.1008-0821.2021.05.003
[18] (Chen Zuqin, Jiang Xun, Ge Jike. Emergency Scenario Analysis Based on Sensitive Information of Online Public Opinion[J]. Journal of Modern Information, 2021, 41(5): 25-32.)
doi: 10.3969/j.issn.1008-0821.2021.05.003
[19] 张泽锋, 毛存礼, 余正涛, 等. 融入领域术语词典的司法舆情敏感信息识别[J]. 中文信息学报, 2022, 36(9): 76-83, 92.
[19] (Zhang Zefeng, Mao Cunli, Yu Zhengtao, et al. Sensitive Judicial Public Opinion Information Recognition with the Domain Terminology Dictionary[J]. Journal of Chinese Information Processing, 2022, 36(9): 76-83, 92.)
[20] Zeng J C, Li J, Song Y, et al. Topic Memory Networks for Short Text Classification[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2018: 3120-3131.
[21] Wang X, Chen R H, Jia Y, et al. Short Text Classification Using Wikipedia Concept Based Document Representation[C]// Proceedings of the International Conference on Information Technology and Applications. 2013: 471-474.
[22] Lan G, Li Y, Hu M T, et al. Knowledge Graph Integrated Graph Neural Networks for Chinese Medical Text Classification[C]// Proceedings of IEEE International Conference on Bioinformatics and Biomedicine. 2021: 682-687.
[23] Li Q M, Han Z C, Wu X M. Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018.
[24] Zhou P, Shi W, Tian J, et al. Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2:Short Papers). 2016: 207-212.
[25] Joulin A, Grave E, Bojanowski P, et al. Bag of Tricks for Efficient Text Classification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2. 2017: 427-431.
[26] Johnson R, Zhang T. Deep Pyramid Convolutional Neural Networks for Text Categorization[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2017: 562-570.
[27] Kingma D P, Ba J. Adam: A Method for Stochastic Optimization[OL]. arXiv Preprint, arXiv:1412.6980.
[1] 贺超城, 黄茜, 李欣儒, 王春迎, 吴江. 元宇宙的冷与热——融合BERT与动态主题模型的微博文本分析*[J]. 数据分析与知识发现, 2023, 7(9): 25-38.
[2] 赵雪峰, 吴德林, 吴伟伟, 孙卓荦, 胡瑾瑾, 廉莹, 单佳宇. 基于深度学习与多分类轮询机制的高质量“卡脖子”技术专利识别模型——以专利申请文件为研究主体*[J]. 数据分析与知识发现, 2023, 7(8): 30-45.
[3] 刘洋, 丁星辰, 马莉莉, 王淳洋, 朱立芳. 基于多维度图卷积网络的旅游评论有用性识别*[J]. 数据分析与知识发现, 2023, 7(8): 95-104.
[4] 胥桂仙, 张子欣, 于绍娜, 董玉双, 田媛. 基于图卷积网络的藏文新闻文本分类*[J]. 数据分析与知识发现, 2023, 7(6): 73-85.
[5] 徐康, 余胜男, 陈蕾, 王传栋. 基于语言学知识增强的自监督式图卷积网络的事件关系抽取方法*[J]. 数据分析与知识发现, 2023, 7(5): 92-104.
[6] 本妍妍, 庞雪芹. 融入词性的医疗命名实体识别研究*[J]. 数据分析与知识发现, 2023, 7(5): 123-132.
[7] 苏明星, 吴厚月, 李健, 黄菊, 张顺香. 基于多层交互注意力机制的商品属性抽取*[J]. 数据分析与知识发现, 2023, 7(2): 108-118.
[8] 张贞港, 余传明. 基于实体与关系融合的知识图谱补全模型研究*[J]. 数据分析与知识发现, 2023, 7(2): 15-25.
[9] 赵一鸣, 潘沛, 毛进. 基于任务知识融合与文本数据增强的医学信息查询意图强度识别研究*[J]. 数据分析与知识发现, 2023, 7(2): 38-47.
[10] 王宇飞, 张智雄, 赵旸, 张梦婷, 李雪思. 中文科技论文标题自动生成系统的设计与实现*[J]. 数据分析与知识发现, 2023, 7(2): 61-71.
[11] 张思阳, 魏苏波, 孙争艳, 张顺香, 朱广丽, 吴厚月. 基于多标签Seq2Seq模型的情绪-原因对提取模型*[J]. 数据分析与知识发现, 2023, 7(2): 86-96.
[12] 刘赏, 沈逸凡. 基于新闻标题-正文差异性的虚假新闻检测方法*[J]. 数据分析与知识发现, 2023, 7(2): 97-107.
[13] 李楠, 汪波. 跨学科语义漂移识别与可视化分析*[J]. 数据分析与知识发现, 2023, 7(10): 15-24.
[14] 潘小宇, 倪渊, 金春华, 张健. 基于超平面-BERT-Louvain优化LDA模型的书法作品价值要素提取及指标体系构建*[J]. 数据分析与知识发现, 2023, 7(10): 109-118.
[15] 施运梅, 袁博, 张乐, 吕学强. IMTS:融合图像与文本语义的虚假评论检测方法*[J]. 数据分析与知识发现, 2022, 6(8): 84-96.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn