Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (9): 31-41    DOI: 10.11925/infotech.2096-3467.2021.0266
Current Issue | Archive | Adv Search |
Construction and Application of GCN Model for Text Classification with Associated Information
Zhou Zeyu1,2,Wang Hao1,2(),Zhao Zibo1,2,Li Yueyan1,2,Zhang Xiaoqin3
1School of Information Management, Nanjing University, Nanjing 210023, China
2Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
3Jinling Library, Nanjing 210023, China
Download: PDF (1027 KB)   HTML ( 20
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to learn the text contexts and the polysemy of words, aiming to improve the performance of automatic text classification. [Objective] We proposed a GCN model for long text classification with associated information. First, we used BERT to obtain the initial features of word vectors of the long texts. Then, we input these initial features into the BiLSTM model to capture their semantic relationship. Third, we represented the word features as nodes of the graph convolutional network SGCN. Fourth, we used the vector similarity between words as the edge to connect the nodes, and construct a graph structure. Finally, we input the long text representation from SGCN into the fully connected layers to finish the classification tasks. [Results] We examined our model with Chinese scientific literature having multiple subjects. The accuracy of our model is 0.834 09, which is better than the benchmark model. [Limitations] We only treated the texts as single topic ones for multi-classification tasks. [Conclusions] The proposed model based on BERT, BiLSTM and SGCN algorithms could effectively classify long texts.

Key wordsGraph Convolutional Network      Deep Learning      BERT      Text Classification     
Received: 16 March 2021      Published: 15 October 2021
ZTFLH:  G202  
Fund:*National Natural Science Foundation of China(72074108);2020 Wuxi Association for Science and Technology Soft Science Research Project(KT-20-C058);Innovative Research Project for Doctoral Candidates of Nanjing University(CXYJ21-69)
Corresponding Authors: Wang Hao     E-mail: ywhaowang@nju.edu.cn

Cite this article:

Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information. Data Analysis and Knowledge Discovery, 2021, 5(9): 31-41.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0266     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I9/31

The Text Multi-Classification Model Based on BERT-BiLSTM-SGCN
分类号 含义 总计数据 训练集数量 测试集数量
R181 流行病学基本理论方法 1 871 1 575 296
R183 传染病预防 1 281 1 158 123
R184 防疫措施与管理 1 909 1 723 186
R259 现代医学内科疾病 1 731 1 512 219
R473 专科护理学 2 323 1 834 489
R511 病毒传染病 4 813 3 529 1 284
R563 肺疾病 1 952 1 668 284
合计 15 880 12 999 2 881
Distribution of Literature with Only One Classification Number
The Adjacency Matrix Constructed Based on the Similarity Between Words
序号 模型 模型介绍 准确率
1 TextBERT-BiLSTM-Softmax 将文档级TextBERT得到该文档的文本嵌入输入BiLSTM模型后,再输入Softmax得到分类结果 0.833 04
2 TextBERT-Softmax 将文档级TextBERT得到该文档的文本嵌入直接输入Softmax得到分类结果 0.829 57
3 TextBERT-BiLSTM-SVM SVM是一个二元线性分类器,自提出以来便在各种机器学习相关工作中取得了很好的效果。其扩展性良好,经过不断发展,SVM方法在非线性分类和多元分类的任务中也表现出不错的实验效果。这里选择SVM方法进行对比实验。将文档级TextBERT得到该文档的文本嵌入,输入BiLSTM模型后,再输入SVM中得到分类结果 0.829 29
4 TextGCN 使用One-Hot作为特征输入,构建基于文档和词的异构图[7],在GCN上对文本进行半监督分类 0.728 90
5 WordBERT-TextGCN 将WordBERT 嵌入初始化进行节点表示,构建与TextGCN相同的基于文档和词的异构图,在GCN上对文本进行半监督分类 0.729 23
6 WordBERT-BiLSTM-TextGCN 将WordBERT 得到的词向量输入BiLSTM模型后得到的特征输入进行节点表示,构建与TextGCN相同的基于文档和词的异构图,在GCN上对文本进行半监督分类 0.734 06
Text Classification Results of Different Models
Classification Results of BERT-BiLSTM-SGCN Under Different Word Segmentation Conditions
Influence of Different Parameters on the Model Results
The Results of Different Categories of WordBERT-BiLSTM-SGCN with Optimal Parameters
[1] 贺鸣, 孙建军, 成颖. 基于朴素贝叶斯的文本分类研究综述[J]. 情报科学, 2016, 34(7):147-154.
[1] ( He Ming, Sun Jianjun, Cheng Ying. Text Classification Based on Naive Bayes: A Review[J]. Information Science, 2016, 34(7):147-154.)
[2] 雷飞. 基于神经网络和决策树的文本分类及其应用研究[D]. 成都: 电子科技大学, 2018.
[2] ( Lei Fei. Research on Text Classification Based on Neural Network and Decision Tree and Its Application[D]. Chengdu: University of Electronic Science and Technology of China, 2018.)
[3] 王昊, 叶鹏, 邓三鸿. 机器学习在中文期刊论文自动分类研究中的应用[J]. 现代图书情报技术, 2014(3):80-87.
[3] ( Wang Hao, Ye Peng, Deng Sanhong. The Application of Machine-Learning in the Research on Automatic Categorization of Chinese Periodical Articles[J]. New Technology of Library and Information Service, 2014(3):80-87.)
[4] 万齐斌, 董方敏, 孙水发. 基于BiLSTM-Attention-CNN混合神经网络的文本分类方法[J]. 计算机应用与软件, 2020, 37(9):94-98, 201.
[4] ( Wan Qibin, Dong Fangmin, Sun Shuifa. Text Classification Method Based on BiLSTM-Attention-CNN Hybrid Neural Network[J]. Computer Applications and Software, 2020, 37(9):94-98, 201.)
[5] 邵良杉, 周玉. 基于语义规则与RNN模型的在线评论情感分类研究[J]. 中文信息学报, 2019, 33(6):124-131.
[5] ( Shao Liangshan, Zhou Yu. Semantic Rules and RNN Based Sentiment Classification for Online Reviews[J]. Journal of Chinese Information Processing, 2019, 33(6):124-131.)
[6] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[7] Yao L, Mao C S, Luo Y. Graph Convolutional Networks for Text Classification [C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019: 7370-7377.
[8] Gao L C, Wang J K, Pi Z X, et al. A Hybrid GCN and RNN Structure Based on Attention Mechanism for Text Classification[J]. Journal of Physics: Conference Series, 2020, 1575:Article No. 012130.
[9] 范涛, 吴鹏, 曹琪. 基于深度学习的多模态融合网民情感识别研究[J]. 信息资源管理学报, 2020, 10(1):39-48.
[9] ( Fan Tao, Wu Peng, Cao Qi. The Research of Sentiment Recognition of Online Users Based on DNNs Multimodal Fusion[J]. Journal of Information Resources Management, 2020, 10(1):39-48.)
[10] 杜若鹏, 鲜国建, 寇远涛. 基于改进TF-IDF-CHI算法的农业科技文献文本特征抽取[J]. 数字图书馆论坛, 2019(8):18-24.
[10] ( Du Ruopeng, Xian Guojian, Kou Yuantao. Improvement and Application of TF-IDF-CHI in Agricultural Science Text Feature Extraction[J]. Digital Library Forum, 2019(8):18-24.)
[11] 靳春妍, 牟冬梅, 王萍, 等. 融入表情特征的网络舆情情感分析方法研究[J]. 科技情报研究, 2020, 2(4):13-22.
[11] ( Jin Chunyan, Mu Dongmei, Wang Ping, et al. Research on Sentiment Analysis Method Integrating Emoticon Feature of Online Public Opinion[J]. Scientific Information Research, 2020, 2(4):13-22.)
[12] 王昊, 邓三鸿, 朱立平, 等. 大数据环境下政务数据的情报价值及其利用研究: 以海关报关商品归类风险规避为例[J]. 科技情报研究, 2020, 2(4):74-89.
[12] ( Wang Hao, Deng Sanhong, Zhu Liping, et al. A Study of Intelligence Value and Employment of Political Data in Big Data Environment: The Risk Avoidance of Customs Declaration Commodities[J]. Scientific Information Research, 2020, 2(4):74-89.)
[13] 章成志, 李卓, 储荷婷. 基于全文内容的学术论文研究方法自动分类研究[J]. 情报学报, 2020, 39(8):852-862.
[13] ( Zhang Chengzhi, Li Zhuo, Chu Heting. Using Full Content to Automatically Classify the Research Methods of Academic Articles[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(8):852-862.)
[14] 吕璐成, 韩涛, 周健, 等. 基于深度学习的中文专利自动分类方法研究[J]. 图书情报工作, 2020, 64(10):75-85.
[14] ( Lyu Lucheng, Han Tao, Zhou Jian, et al. Research on the Method of Chinese Patent Automatic Classification Based on Deep Learning[J]. Library and Information Service, 2020, 64(10):75-85.)
[15] 是沁, 李阳. 基于深度学习的人文社科专题数据库文本资源分类研究: 以“新华丝路”数据库与“一带一路”专题库为例[J]. 信息资源管理学报, 2020, 10(5):23-29, 37.
[15] ( Shi Qin, Li Yang. Research on Text Resource Classification of Humanities and Social Sciences Thematic Database Based on Deep Learning: Taking “XinHua Silkroad” Database and “One Belt One Road” Database as Examples[J]. Journal of Information Resources Management, 2020, 10(5):23-29, 37.)
[16] 王倩, 曾金, 刘家伟, 等. 基于深度学习的学术文本段落结构功能识别研究[J]. 情报科学, 2020, 38(3):64-69.
[16] ( Wang Qian, Zeng Jin, Liu Jiawei, et al. Structure Function Recognition of Academic Text Paragraph Based on Deep Learning[J]. Information Science, 2020, 38(3):64-69.)
[17] 徐绪堪, 周泽聿. 基于多尺度BiLSTM-CNN的微信推文的情感分类模型及应用研究[J]. 情报科学, 2021, 39(5):130-137.
[17] ( Xu Xukan, Zhou Zeyu. A Multi-scale BiLSTM-CNN Based Emotion Classification Model for WeChat Tweets and Its Application[J]. Information Science, 2021, 39(5):130-137.)
[18] 王晰巍, 邢云菲, 韦雅楠, 等. 大数据驱动的社交网络舆情用户情感主题分类模型构建研究: 以“移民”主题为例[J]. 信息资源管理学报, 2020, 10(1):29-38, 48.
[18] ( Wang Xiwei, Xing Yunfei, Wei Ya'nan, et al. Research on the Topic Model Construction of Sentiment Classification of Public Opinion Users in Social Networks Driven by Big Data: Taking “Immigration” as the Topic[J]. Journal of Information Resources Management, 2020, 10(1):29-38, 48.)
[19] 徐彤阳, 尹凯. 基于深度学习的数字图书馆文本分类研究[J]. 情报科学, 2019, 37(10):13-19.
[19] ( Xu Tongyang, Yin Kai. Text Classification of Digital Library Based on Deep Learning[J]. Information Science, 2019, 37(10):13-19.)
[20] Yu S S, Su J D, Luo D. Improving BERT-Based Text Classification with Auxiliary Sentence and Domain Knowledge[J]. IEEE Access, 2019, 7:176600-176612.
doi: 10.1109/Access.6287639
[21] 陆伟, 李鹏程, 张国标, 等. 学术文本词汇功能识别: 基于BERT向量化表示的关键词自动分类研究[J]. 情报学报, 2020, 39(12):1320-1329.
[21] ( Lu Wei, Li Pengcheng, Zhang Guobiao, et al. Recognition of Lexical Functions in Academic Texts: Automatic Classification of Keywords Based on BERT Vectorization[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(12):1320-1329.)
[22] 赵旸, 张智雄, 刘欢, 等. 基于BERT模型的中文医学文献分类研究[J]. 数据分析与知识发现, 2020, 4(8):41-49.
[22] ( Zhao Yang, Zhang Zhixiong, Liu Huan, et al. Classification of Chinese Medical Literature with BERT Model[J]. Data Analysis and Knowledge Discovery, 2020, 4(8):41-49.)
[23] Tang H L, Mi Y, Xue F, et al. An Integration Model Based on Graph Convolutional Network for Text Classification[J]. IEEE Access, 2020, 8:148865-148876.
doi: 10.1109/Access.6287639
[24] Li G H, Müller M, Thabet A, et al. DeepGCNs: Can GCNs Go as Deep as CNNs? [C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019: 9266-9275.
[25] Liu J X, Meng F R, Zhou Y, et al. Character-Level Neural Networks for Short Text Classification [C]//Proceedings of 2017 International Smart Cities Conference. IEEE, 2017. DOI: 10.1109/ISC2.2017.8090812.
doi: 10.1109/ISC2.2017.8090812
[26] 张晓丹. 改进的图神经网络文本分类模型应用研究: 以NSTL科技期刊文献分类为例[J]. 情报杂志, 2021, 40(1):184-188.
[26] ( Zhang Xiaodan. The Application of Improved Graph Convolutional Neural Network in Big Data Classification of Scientific and Technological Documents[J]. Journal of Intelligence, 2021, 40(1):184-188.)
[27] 郭利敏. 基于卷积神经网络的文献自动分类研究[J]. 图书与情报, 2017(6):96-103.
[27] ( Guo Limin. Study of Automatic Classification of Literature Based on Convolution Neural Network[J]. Library & Information, 2017(6):96-103.)
[28] 罗鹏程, 王一博, 王继民. 基于深度预训练语言模型的文献学科自动分类研究[J]. 情报学报, 2020, 39(10):1046-1059.
[28] ( Luo Pengcheng, Wang Yibo, Wang Jimin. Automatic Discipline Classification for Scientific Papers Based on a Deep Pre-Training Language Model[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(10):1046-1059.)
[29] Khatri A, Pranav P, Anand K M. Sarcasm Detection in Tweets with BERT and GloVe Embeddings[OL]. arXiv Preprint,arXiv: 2006. 11512.
[30] Sharfuddin A A, Tihami M N, Islam M S. A Deep Recurrent Neural Network with BiLSTM Model for Sentiment Classification [C]//Proceedings of 2018 International Conference on Bangla Speech and Language Processing. IEEE, 2018.
[31] Lu Z B, Du P, Nie J Y. VGCN-BERT: Augmenting BERT with Graph Embedding for Text Classification [C]//Proceedings of European Conference on Information Retrieval. Springer, Cham, 2020: 369-382.
[32] Chen H Y, Lin Y S, Lee C C. Through the Words of Viewers: Using Comment-Content Entangled Network for Humor Impression Recognition [C]//Proceedings of 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021.
[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Fan Tao,Wang Hao,Wu Peng. Sentiment Analysis of Online Users' Negative Emotions Based on Graph Convolutional Network and Dependency Parsing[J]. 数据分析与知识发现, 2021, 5(9): 97-106.
[3] Ma Jiangwei, Lv Xueqiang, You Xindong, Xiao Gang, Han Junmei. Extracting Relationship Among Military Domains with BERT and Relation Position Features[J]. 数据分析与知识发现, 2021, 5(8): 1-12.
[4] Li Wenna, Zhang Zhixiong. Entity Alignment Method for Different Knowledge Repositories with Joint Semantic Representation[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[5] Wang Hao, Lin Kerou, Meng Zhen, Li Xinlei. Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[6] Yu Xuehan, He Lin, Xu Jian. Extracting Events from Ancient Books Based on RoBERTa-CRF[J]. 数据分析与知识发现, 2021, 5(7): 26-35.
[7] Zhao Danning,Mu Dongmei,Bai Sen. Automatically Extracting Structural Elements of Sci-Tech Literature Abstracts Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[8] Xu Yuemei, Wang Zihou, Wu Zixin. Predicting Stock Trends with CNN-BiLSTM Based Multi-Feature Integration Model[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[9] Lu Quan, He Chao, Chen Jing, Tian Min, Liu Ting. A Multi-Label Classification Model with Two-Stage Transfer Learning[J]. 数据分析与知识发现, 2021, 5(7): 91-100.
[10] Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Cheng. Sentence Alignment Method Based on BERT and Multi-similarity Fusion[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[11] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[12] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[13] Yin Pengbo,Pan Weimin,Zhang Haijun,Chen Degang. Identifying Clickbait with BERT-BiGA Model[J]. 数据分析与知识发现, 2021, 5(6): 126-134.
[14] Yu Bengong,Zhu Xiaojie,Zhang Ziwei. A Capsule Network Model for Text Classification with Multi-level Feature Extraction[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[15] Song Ruoxuan,Qian Li,Du Yu. Identifying Academic Creative Concept Topics Based on Future Work of Scientific Papers[J]. 数据分析与知识发现, 2021, 5(5): 10-20.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn