Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (11): 26-36    DOI: 10.11925/infotech.2096-3467.2022.1250
Current Issue | Archive | Adv Search |
Online Sensitive Text Classification Model Based on Heterogeneous Graph Convolutional Network
Gao Haoxin1,2,Sun Lijuan1,3,Wu Jingchen1,4,Gao Yutong6,Wu Xu1,2,5()
1Key Laboratory of Trustworthy Distributed Computing and Service, Beijing University of Posts and Telecommunications, Beijing 100876, China
2School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China
3School of Economics and Management, Beijing University of Posts and Telecommunications, Beijing 100876, China
4School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China
5Beijing University of Posts and Telecommunications Library, Beijing 100876, China
6School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
Download: PDF (1006 KB)   HTML ( 26
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a classification model for sensitive texts in online communities based on a graph neural network, which supports public opinion governance and information security. [Methods] First, we constructed a heterogeneous graph based on sensitive entities of texts and words, which included the existing knowledge about sensitive information of online public opinion. Second, we adopted BERT and GCN to capture high-level semantic information of the text and global co-occurrence features. Third, we combined the complementary advantages of pre-training and graph models to address heterogeneous issues due to structural differences between long and short texts. Finally, we classified sensitive texts based on features of online public opinion. [Results] We examined the proposed model on a self-made sensitive text dataset of online public opinion. The accuracy of our method reached 70.80%, which was 3.52% higher than that of other models. [Limitations] Large heterogeneous graphs built on long texts will reduce the computing speed. [Conclusions] The proposed model could effectively identify and classify sensitive content from different online texts.

Key wordsGraph Convolutional Network      Sensitive Text Classification      Heterogeneous Graph      BERT     
Received: 23 November 2022      Published: 22 March 2023
ZTFLH:  TP183 G350  
Fund:National Natural Science Foundation of China(72293583);China Postdoctoral Science Foundation(2022M710463)
Corresponding Authors: Wu Xu,ORCID:0000-0002-1297-2726,E-mail: wux@bupt.edu.cn。   

Cite this article:

Gao Haoxin, Sun Lijuan, Wu Jingchen, Gao Yutong, Wu Xu. Online Sensitive Text Classification Model Based on Heterogeneous Graph Convolutional Network. Data Analysis and Knowledge Discovery, 2023, 7(11): 26-36.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.1250     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I11/26

STC-HGCN Model
An Example of Heterogeneous Graph
Schematic of GCN for Sensitive Text
Classification Scheme of Public Opinion Sensitive Texts in the Field of Education
类别 文本 来源
安全类 美得州枪击案细节披露:枪手将自己反锁在教室扫射孩子们无路可逃。 虎扑
管理类 当然不仅仅是听叶财德的意见,他只是一个社区防控的代表,恰逢其会被推到了台前… 水木社区
声誉类 华为让人反感不是没有缘由的,哪个应用离开华为没法运行?资费下降不感谢运营商… 水木社区
学术类 研究方法扎实固然很重要,也代表了学术水平高。但是学术水平高… 虎扑
灾害类 风暴席卷加拿大东部致8死:大批电线杆折断 50万人遭断电。 环球网
政治类 外交部应该直接明确的说清楚,乌克兰现政府是… 观察者网
Examples of Sensitive Text
类别 数量 平均长度 最大长度 最小长度
安全类 386 438 6 279 12
管理类 7 954 997 24 875 10
声誉类 2 250 831 32 767 11
学术类 2 323 1 098 28 442 15
灾害类 82 471 5 141 22
政治类 709 709 32 767 8
全部 13 704 890 32 767 8
Dataset Information
模型 准确率
/%
加权平均
精确率/%
加权平均
召回率/%
加权平均
F1值/%
TextRNN 60.83 62.34 60.83 56.09
FastText 63.73 63.51 63.73 60.95
TextRNN-Att 64.58 63.11 64.58 62.76
DPCNN 64.72 64.90 64.72 62.55
TextCNN 66.15 64.63 65.15 63.08
TextRCNN 66.29 65.40 66.29 64.50
BERT 67.28 68.28 67.28 66.22
STC-HGCN 70.80 71.10 70.80 70.23
Results of Contrast Experiment
模型 准确率/%
①移除实体节点 68.42
②移除GCN 67.28
③移除BERT 68.64
完整模型 70.80
Results of Ablation Study
[1] Maron M E. Automatic Indexing: An Experimental Inquiry[J]. Journal of the ACM, 1961, 8(3): 404-417.
doi: 10.1145/321075.321084
[2] Cover T, Hart P. Nearest Neighbor Pattern Classification[J]. IEEE Transactions on Information Theory, 1967, 13(1): 21-27.
doi: 10.1109/TIT.1967.1053964
[3] Drucker H, Wu D, Vapnik V N. Support Vector Machines for Spam Categorization[J]. IEEE Transactions on Neural Networks, 1999, 10(5): 1048-1054.
doi: 10.1109/72.788645 pmid: 18252607
[4] Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
[5] Liu P F, Qiu X P, Huang X J. Recurrent Neural Network for Text Classification with Multi-Task Learning[OL]. arXiv Preprint, arXiv: 1605.05101.
[6] Tai K S, Socher R, Manning C D. Improved Semantic Representations from Tree-Structured Long Short-Term Memory Networks[OL]. arXiv Preprint, arXiv:1503.00075.
[7] Lai S W, Xu L H, Liu K, et al. Recurrent Convolutional Neural Networks for Text Classification[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015: 2267-2273.
[8] Wu Z H, Pan S R, Chen F W, et al. A Comprehensive Survey on Graph Neural Networks[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(1): 4-24.
doi: 10.1109/TNNLS.5962385
[9] Kipf T N, Welling M. Semi-Supervised Classification with Graph Convolutional Networks[OL]. arXiv Preprint, arXiv:1609.02907.
[10] Yao L, Mao C S, Luo Y. Graph Convolutional Networks for Text Classification[C]// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019: 7370-7377.
[11] Huang L Z, Ma D H, Li S J, et al. Text Level Graph Neural Network for Text Classification[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 3444-3450.
[12] Zhang Y F, Yu X L, Cui Z Y, et al. Every Document Owns Its Structure: Inductive Text Classification via Graph Neural Networks[OL]. arXiv Preprint, arXiv: 2004.13826.
[13] Hu L M, Yang T C, Shi C, et al. Heterogeneous Graph Attention Networks for Semi-supervised Short Text Classification[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 4821-4830.
[14] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[15] Lin Y X, Meng Y X, Sun X F, et al. BertGCN: Transductive Text Classification by Combining GCN and BERT[OL]. arXiv Preprint, arXiv: 2105.05727.
[16] Yu Z X, Wu X, Xie X Q, et al. Hot Event Detection for Social Media Based on Keyword Semantic Information[C]// Proceedings of 2019 IEEE 4th International Conference on Data Science in Cyberspace. 2019: 410-415.
[17] Gao L, Wu X, Wu J C, et al. Sensitive Image Information Recognition Model of Network Community Based on Content Text[C]// Proceedings of 2021 IEEE 6th International Conference on Data Science in Cyberspace. 2021: 47-52.
[18] 陈祖琴, 蒋勋, 葛继科. 基于网络舆情敏感信息的突发事件情景分析[J]. 现代情报, 2021, 41(5): 25-32.
doi: 10.3969/j.issn.1008-0821.2021.05.003
[18] (Chen Zuqin, Jiang Xun, Ge Jike. Emergency Scenario Analysis Based on Sensitive Information of Online Public Opinion[J]. Journal of Modern Information, 2021, 41(5): 25-32.)
doi: 10.3969/j.issn.1008-0821.2021.05.003
[19] 张泽锋, 毛存礼, 余正涛, 等. 融入领域术语词典的司法舆情敏感信息识别[J]. 中文信息学报, 2022, 36(9): 76-83, 92.
[19] (Zhang Zefeng, Mao Cunli, Yu Zhengtao, et al. Sensitive Judicial Public Opinion Information Recognition with the Domain Terminology Dictionary[J]. Journal of Chinese Information Processing, 2022, 36(9): 76-83, 92.)
[20] Zeng J C, Li J, Song Y, et al. Topic Memory Networks for Short Text Classification[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2018: 3120-3131.
[21] Wang X, Chen R H, Jia Y, et al. Short Text Classification Using Wikipedia Concept Based Document Representation[C]// Proceedings of the International Conference on Information Technology and Applications. 2013: 471-474.
[22] Lan G, Li Y, Hu M T, et al. Knowledge Graph Integrated Graph Neural Networks for Chinese Medical Text Classification[C]// Proceedings of IEEE International Conference on Bioinformatics and Biomedicine. 2021: 682-687.
[23] Li Q M, Han Z C, Wu X M. Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018.
[24] Zhou P, Shi W, Tian J, et al. Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2:Short Papers). 2016: 207-212.
[25] Joulin A, Grave E, Bojanowski P, et al. Bag of Tricks for Efficient Text Classification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2. 2017: 427-431.
[26] Johnson R, Zhang T. Deep Pyramid Convolutional Neural Networks for Text Categorization[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2017: 562-570.
[27] Kingma D P, Ba J. Adam: A Method for Stochastic Optimization[OL]. arXiv Preprint, arXiv:1412.6980.
[1] He Chaocheng, Huang Qian, Li Xinru, Wang Chunying, Wu Jiang. Trending Topics on Metaverse: A Microblog Text Analysis with BERT and DTM[J]. 数据分析与知识发现, 2023, 7(9): 25-38.
[2] Zhao Xuefeng, Wu Delin, Wu Weiwei, Sun Zhuoluo, Hu Jinjin, Lian Ying, Shan Jiayu. Identifying High-Quality Technology Patents Based on Deep Learning and Multi-Category Polling Mechanism——Case Study of Patent Applications[J]. 数据分析与知识发现, 2023, 7(8): 30-45.
[3] Liu Yang, Ding Xingchen, Ma Lili, Wang Chunyang, Zhu Lifang. Usefulness Detection of Travel Reviews Based on Multi-dimensional Graph Convolutional Networks[J]. 数据分析与知识发现, 2023, 7(8): 95-104.
[4] Xu Guixian, Zhang Zixin, Yu Shaona, Dong Yushuang, Tian Yuan. Tibetan News Text Classification Based on Graph Convolutional Networks[J]. 数据分析与知识发现, 2023, 7(6): 73-85.
[5] Xu Kang, Yu Shengnan, Chen Lei, Wang Chuandong. Linguistic Knowledge-Enhanced Self-Supervised Graph Convolutional Network for Event Relation Extraction[J]. 数据分析与知识发现, 2023, 7(5): 92-104.
[6] Ben Yanyan, Pang Xueqin. Identifying Medical Named Entities with Word Information[J]. 数据分析与知识发现, 2023, 7(5): 123-132.
[7] Su Mingxing, Wu Houyue, Li Jian, Huang Ju, Zhang Shunxiang. AEMIA:Extracting Commodity Attributes Based on Multi-level Interactive Attention Mechanism[J]. 数据分析与知识发现, 2023, 7(2): 108-118.
[8] Zhang Zhengang, Yu Chuanming. Knowledge Graph Completion Model Based on Entity and Relation Fusion[J]. 数据分析与知识发现, 2023, 7(2): 15-25.
[9] Zhao Yiming, Pan Pei, Mao Jin. Recognizing Intensity of Medical Query Intentions Based on Task Knowledge Fusion and Text Data Enhancement[J]. 数据分析与知识发现, 2023, 7(2): 38-47.
[10] Wang Yufei, Zhang Zhixiong, Zhao Yang, Zhang Mengting, Li Xuesi. Designing and Implementing Automatic Title Generation System for Sci-Tech Papers[J]. 数据分析与知识发现, 2023, 7(2): 61-71.
[11] Zhang Siyang, Wei Subo, Sun Zhengyan, Zhang Shunxiang, Zhu Guangli, Wu Houyue. Extracting Emotion-Cause Pairs Based on Multi-Label Seq2Seq Model[J]. 数据分析与知识发现, 2023, 7(2): 86-96.
[12] Liu Shang, Shen Yifan. Detecting Fake News Based on Title-Content Difference[J]. 数据分析与知识发现, 2023, 7(2): 97-107.
[13] Qiang Zishan,Gu Yijun. Detecting Social Media Rumors Based on Multimodal Heterogeneous Graph[J]. 数据分析与知识发现, 2023, 7(11): 68-78.
[14] Li Nan, Wang Bo. Recognition and Visual Analysis of Interdisciplinary Semantic Drift[J]. 数据分析与知识发现, 2023, 7(10): 15-24.
[15] Pan Xiaoyu, Ni Yuan, Jin Chunhua, Zhang Jian. Extracting Value Elements and Constructing Index System for Calligraphy Works Based on Hyperplane-BERT-Louvain Optimized LDA Model[J]. 数据分析与知识发现, 2023, 7(10): 109-118.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn