Please wait a minute...
Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (12): 68-76    DOI: 10.11925/infotech.2096-3467.2018.0391
Current Issue | Archive | Adv Search |
Classifying Chinese Texts with CapsNet
Feng Guoming, Zhang Xiaodong(), Liu Suhui
School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China
Download: PDF (732 KB)   HTML ( 3
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study tries to address the issues facing long text representation and use CapsNet to improve the accuracy of Chinese text classification. [Methods] First, we proposed a LDA matrix and word vector to represent the long texts. Then, we constructed a Chinese classification model based on CapsNet. Third, we examined the proposed model with Sogou news corpus and the text classification corpus of Fudan University. Finally, we compared our results with those of the classic models (e.g., TextCNN, DNN and so on). [Results] The performance of CapsNet model was better than other models. The classification accuracy in five categories of short and long texts reached 89.6% and 96.9% respectively. The convergence speed of the proposed model was almost two times faster than that of the CNN model. [Limitations] The computational complexity of the model is high, which limits the size of testing corpus. [Conclusions] The proposed Chinese text representation method and the modified CapsNet model have better accuracy, convergence speed and robustness than the existing ones.

Key wordsText Categorization      CapsNet      Deep Learning      Text Representation      TextCNN     
Received: 08 April 2018      Published: 16 January 2019
ZTFLH:  G350  

Cite this article:

Feng Guoming,Zhang Xiaodong,Liu Suhui. Classifying Chinese Texts with CapsNet. Data Analysis and Knowledge Discovery, 2018, 2(12): 68-76.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.0391     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2018/V2/I12/68

类别
数据集
训练集 测试集 合计条数
体育 8 000 2 000 10 000
娱乐 8 000 2 000 10 000
教育 8 000 2 000 10 000
财经 8 000 2 000 10 000
科技 8 000 2 000 10 000
合计条数 40 000 10 000 50 000
类别
数据集
训练集 测试集 合计词数
C19-Computer 850篇 150篇 276万
C32-Agriculture 850篇 150篇 293万
C34-Economy 850篇 150篇 282万
C38-Politics 850篇 150篇 213万
C39-Sports 850篇 150篇 255万
合计词数 1 130万 189万 1 319万
任务 序号 模型 文本表示 分类器
短文本
分类
1 LDA+KNN LDA KNN
2 LDA+DNN LDA DNN
3 W2V_matrix+TextCNN W2V_matrix CNN
4 W2V_matrix+CapsNet W2V_matrix CapsNet
长文本
分类
5 LDA+DNN LDA DNN
6 LDA_matrix+DNN LDA_matrix DNN
7 LDA_matrix+CapsNet LDA_matrix CapsNet
8 W2V_cuboid+CNN W2V_cuboid CNN
9 W2V_cuboid+CapsNet W2V_cuboid CapsNet
序号 方法 P R F
1 LDA+KNN 0.681 0.712 0.696
2 LDA+DNN 0.749 0.793 0.770
3 W2V_matrix+TextCNN 0.838 0.866 0.852
4 W2V_matrix+CapsNet 0.896 0.901 0.898
序号 方法 P R F
5 LDA+DNN 0.647 0.662 0.654
6 LDA_matrix+DNN 0.784 0.807 0.795
7 LDA_matrix+CapsNet 0.926 0.933 0.929
8 W2V_cuboid+CNN 0.895 0.913 0.904
9 W2V_cuboid+CapsNet 0.969 0.972 0.970
[1] 唐明, 朱磊, 邹显春. 基于Word2Vec的一种文档向量表示[J]. 计算机科学, 2016, 43(6): 214-217.
doi: 10.11896/j.issn.1002-137X.2016.6.043
[1] (Tang Ming, Zhu Lei, Zou Xianchun.Document Vector Representation Based on Word2Vec[J]. Computer Science, 2016, 43(6): 214-217.)
doi: 10.11896/j.issn.1002-137X.2016.6.043
[2] 幸凯. 基于卷积神经网络的文本表示建模方法研究[D]. 武汉: 华中师范大学, 2017.
[2] (Xing Kai.Research on Text Modeling Based on Convolutional Neural Network Approaches[D]. Wuhan: Central China Normal University, 2017.)
[3] 黄磊, 杜昌顺. 基于递归神经网络的文本分类研究[J]. 北京化工大学学报: 自然科学版, 2017, 44(1): 98-104.
[3] (Huang Lei, Du Changshun.Application of Recurrent Neural Networks in Text Classification[J]. Journal of Beijing University of Chemical Technology: Natural Science Edition, 2017, 44(1): 98-104.)
[4] Sabour S, Frosst N, Hinton G E.Dynamic Routing Between Capsules[OL]. arXiv Preprint. arXiv: 1710.09829.
[5] Salton G, Wong A, Yang C S.A Vector Space Model for Automatic Indexing[J]. Communications of the ACM,1975, 18(11): 613-620.
doi: 10.1145/361219.361220
[6] Deerwester S, Dumais S, Furnas G W, et al.Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
doi: 10.1002/(ISSN)1097-4571
[7] Hofmann T.Unsupervised Learning by Probabilistic Latent Semantic Analysis[J]. Machine Learning, 2001, 42(1-2): 177-196.
doi: 10.1023/A:1007617005950
[8] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3(2): 993-1022.
[9] Mikolov T, Chen K, Corrado G, et al.Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint. arXiv: 1301.3781.
[10] Joachims T.Text Categorization with Support Vector Machines: Learning with Many Relevant Features[C]// Proceedings of the 10th European Conference on Machine Learning. 1998: 137-142.
[11] Kim Y.Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint. arXiv: 1408.5882.
doi: 10.3115/v1/D14-1181
[12] Kalchbrenner N, Grefenstette E, Blunsom P.A Convolutional Neural Network for Modelling Sentences[OL]. arXiv Preprint. arXiv: 1404.2188.
doi: 10.3115/v1/P14-1062
[13] Liu P, Qiu X, Huang X.Recurrent Neural Network for Text Classification with Multi-Task Learning[C]// Proceedings of the 25th International Joint Conference on Artificial Intelligence. 2016: 2873-2879.
[14] Joulin A, Grave E, Bojanowski P, et al.Bag of Tricks for Efficient Text Classification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 2016: 427-431.
[15] 崔建明, 刘建明, 廖周宇. 基于SVM算法的文本分类技术研究[J]. 计算机仿真, 2013, 30(2): 299-302.
doi: 10.3969/j.issn.1006-9348.2013.02.069
[15] (Cui Jianming, Liu Jianming, Liao Zhouyu.Research of Text Categorization Based on Support Vector Machine[J]. Computer Simulation, 2013, 30(2): 299-302.)
doi: 10.3969/j.issn.1006-9348.2013.02.069
[16] 李玉鑑, 王影, 冷强奎. 基于最近邻子空间搜索的两类文本分类方法[J]. 计算机工程与科学, 2015, 37(1): 168-172.
doi: 10.3969/j.issn.1007-130X.2015.01.026
[16] (Li Yujian, Wang Ying, Leng Qiangkui.Two-class Text Categorization Using Nearest Subspace Search[J]. Computer Engineering and Science, 2015, 37(1): 168-172.)
doi: 10.3969/j.issn.1007-130X.2015.01.026
[17] 吕超镇, 姬东鸿, 吴飞飞. 基于LDA特征扩展的短文本分类[J]. 计算机工程与应用, 2015, 51(4):123-127.
doi: 10.3778/j.issn.1002-8331.1403-0448
[17] (Lv Chaozhen, Ji Donghong, Wu Feifei.Short Text Classification Based on LDA Feature Extension[J]. Computer Engineering and Applications, 2015, 51(4): 123-127.)
doi: 10.3778/j.issn.1002-8331.1403-0448
[18] 郭东亮, 刘小明, 郑秋生. 基于卷积神经网络的互联网短文本分类方法[J]. 计算机与现代化, 2017(4): 78-81.
doi: 10.3969/j.issn.1006-2475.2017.04.016
[18] (Guo Dongliang, Liu Xiaoming, Zheng Qiusheng.Internet Short-text Classification Method Based on CNNs[J]. Computer and Modernization, 2017(4): 78-81.)
doi: 10.3969/j.issn.1006-2475.2017.04.016
[19] 陈杰, 陈彩, 梁毅. 基于Word2Vec的文档分类方法[J]. 计算机系统应用, 2017, 26(11): 159-164.
doi: 10.15888/j.cnki.csa.006055
[19] (Chen Jie, Chen Cai, Liang Yi.Document Classification Method Based on Word2Vec[J]. Computer Systems & Applications, 2017, 26(11): 159-164.)
doi: 10.15888/j.cnki.csa.006055
[20] 夏从零, 钱涛, 姬东鸿. 基于事件卷积特征的新闻文本分类[J]. 计算机应用研究, 2017, 34(4): 991-994.
doi: 10.3969/j.issn.1001-3695.2017.04.007
[20] (Xia Congling, Qian Tao, Ji Donghong.Event Convolutional Feature Based News Documents Classification[J]. Application Research of Computers, 2017, 34(4): 991-994.)
doi: 10.3969/j.issn.1001-3695.2017.04.007
[1] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[2] Xu Yuemei, Wang Zihou, Wu Zixin. Predicting Stock Trends with CNN-BiLSTM Based Multi-Feature Integration Model[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[3] Zhao Danning,Mu Dongmei,Bai Sen. Automatically Extracting Structural Elements of Sci-Tech Literature Abstracts Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[4] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[5] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[6] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[7] Chang Chengyang,Wang Xiaodong,Zhang Shenglei. Polarity Analysis of Dynamic Political Sentiments from Tweets with Deep Learning Method[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[8] Feng Yong,Liu Yang,Xu Hongyan,Wang Rongbing,Zhang Yonggang. Recommendation Model Incorporating Neighbor Reviews for GRU Products[J]. 数据分析与知识发现, 2021, 5(3): 78-87.
[9] Hu Haotian,Ji Jinfeng,Wang Dongbo,Deng Sanhong. An Integrated Platform for Food Safety Incident Entities Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[10] Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[11] Lv Xueqiang,Luo Yixiong,Li Jiaquan,You Xindong. Review of Studies on Detecting Chinese Patent Infringements[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[12] Cheng Bin,Shi Shuicai,Du Yuncheng,Xiao Shibin. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[13] Li Danyang, Gan Mingxin. Music Recommendation Method Based on Multi-Source Information Fusion[J]. 数据分析与知识发现, 2021, 5(2): 94-105.
[14] Yu Chuanming, Zhang Zhengang, Kong Lingge. Comparing Knowledge Graph Representation Models for Link Prediction[J]. 数据分析与知识发现, 2021, 5(11): 29-44.
[15] Han Pu, Zhang Wei, Zhang Zhanpeng, Wang Yuxin, Fang Haoyu. Sentiment Analysis of Weibo Posts on Public Health Emergency with Feature Fusion and Multi-Channel[J]. 数据分析与知识发现, 2021, 5(11): 68-79.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn