Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (10): 89-97    DOI: 10.11925/infotech.2096-3467.2019.0081
Current Issue | Archive | Adv Search |
Classifying Texts with KACC Model
Yuman Li,Zhibo Chen(),Fu Xu
School of Information Science & Technology, Beijing Forestry University, Beijing 100083, China
Download: PDF (710 KB)   HTML ( 13
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to improve the quality of text representation, and correlate contents with text label vectors, aiming to improve the classification results. [Methods] Firstly, we modified the keyword extraction method (KE). We used the keyword vectors to represent the text, and adopted a category label representation algorithm (CLR) to create the text vectors. Then, we employed the attention-based capsule network (Attention-Capsnet) as the classifier, to construct the KACC (KE-Attention-Capsnet-CLR) model. Finally, we compared our classification results with other methods. [Results] KACC model effectively improved the data quality, which led to better Precision, Recall and F-Measure than existing models. The classification precision reached 97.4%. [Limitations] The experimental data size needs to be expanded, and more research is needed to examine the category discrimination rules with other corpuses. [Conclusions] KACC model is an effective classification model for texts.

Key wordsText Classification      Keywords Extraction      Attention Mechanism      Capsule Network      Category Label Representation     
Received: 18 January 2019      Published: 25 November 2019
ZTFLH:  TP391  
Corresponding Authors: Zhibo Chen     E-mail: zhibo@bjfu.edu.cn

Cite this article:

Yuman Li,Zhibo Chen,Fu Xu. Classifying Texts with KACC Model. Data Analysis and Knowledge Discovery, 2019, 3(10): 89-97.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.0081     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I10/89

评价指标
模型序号
MRR Bpref P R F
1 0.6429 0.7739 0.7491 0.6057 0.6698
2 0.6330 0.7635 0.7226 0.5903 0.6498
3 0.5251 0.7394 0.6240 0.4934 0.5511
4 0.6070 0.7306 0.6741 0.5157 0.5844
5 0.5995 0.7177 0.6276 0.4852 0.5473
6 0.6861 0.7820 0.7776 0.5914 0.6718
类别
数据集
训练集(篇) 测试集(篇) 合计词数(个)
文化 800 200 203.6万
娱乐 800 200 59.6万
历史 800 200 363.3万
军事 800 200 135.4万
文学 800 200 82.2万
合计词数(个) 677.3万 166.8万 844.1万
序号 模型 P R F1
1 FT+CNN+OneHot 0.891 0.885 0.888
2 KE+CNN+OneHot 0.839 0.870 0.854
3 KE+Attention+CNN+OneHot 0.888 0.860 0.874
4 KE+Attention+CNN+CLR 0.901 0.895 0.898
5 KE+Capsnet+OneHot 0.889 0.900 0.894
6 KE+Attention+Capsnet+OneHot 0.954 0.925 0.939
7 KE+Attention+Capsnet+CLR 0.974 0.970 0.972
类别
模型
文学 文化 历史 军事 娱乐
1 0.815 0.840 0.885 0.930 0.985
2 0.770 0.765 0.855 0.870 0.935
3 0.830 0.845 0.890 0.910 0.965
4 0.875 0.880 0.905 0.915 0.930
5 0.800 0.805 0.945 0.910 0.985
6 0.925 0.920 0.955 0.970 1.000
7 0.945 0.950 0.975 1.000 1.000
Avg 0.852 0.858 0.916 0.929 0.971
区分度 0.876 0.882 0.943 0.969 0.998
[1] 江伟, 金忠 . 基于短语注意机制的文本分类[J]. 中文信息学报, 2018,32(2):102-109, 119.
[1] ( Jiang Wei, Jin Zhong . Text Classification Based on Phrase Attention Mechanism[J]. Journal of Chinese Information Processing, 2018,32(2):102-109, 119.)
[2] 孙飞, 郭嘉丰, 兰艳艳 , 等. 面向文本分类的有监督显式语义表示[J]. 数据采集与处理, 2017,32(3):550-558.
[2] ( Sun Fei, Guo Jiafeng, Lan Yanyan , et al. Supervised Explicit Semantic Representation for Text Categorization[J]. Journal of Data Acquisition and Processing, 2017,32(3):550-558.)
[3] Salton G, Yu C T. On the Construction of Effective Vocabularies for Information Retrieval [C]// Proceedings of the 1973 Meeting on Programming Languages and Information Retrieval. ACM, 1973: 48-60.
[4] 杨凯艳 . 基于改进的TFIDF关键词自动提取算法研究[D]. 湘潭: 湘潭大学, 2015.
[4] ( Yang Kaiyan . Research on Automatic Keyword Extraction Algorithm Based on Improved TFIDF[D]. Xiangtan:Xiangtan University, 2015.)
[5] 程岚岚 . 面向领域的中文搜索引擎若干关键技术研究[D]. 天津: 天津大学, 2006.
[5] ( Cheng Lanlan . The Study of Key Technologies for Chinese Domain-Oriented Search Engine[D]. Tianjin: Tianjin University, 2006.)
[6] 李华灿 . 基于统计与协同过滤的关键词提取研究[D]. 西安: 西安电子科技大学, 2015.
[6] ( Li Huacan . Keyword Extraction Base on Statistical and Collaborative Filtering[D]. Xi’an: Xidian University, 2015.)
[7] 谢晋 . 基于词跨度的中文文本关键词提取及在文本分类中的应用[D]. 杭州: 浙江工业大学, 2011.
[7] ( Xie Jin . Chinese Keyword Extraction Method Based on Word Span and Its Application in Text Classification[D]. Hangzhou: Zhejiang University of Technology, 2011.)
[8] 陈凯, 黄英来, 高文韬 , 等. 一种基于属性加权补集的朴素贝叶斯文本分类算法[J]. 哈尔滨理工大学学报, 2018,23(4):69-74.
[8] ( Chen Kai, Huang Yinglai, Gao Wentao , et al. An Improved Naive Bayesian Text Classification Algorithm Based on Weighted Features and Its Complementary Set[J]. Journal of Harbin University of Science and Technology, 2018,23(4):69-74.)
[9] 姚全珠, 宋志理, 彭程 . 基于LDA模型的文本分类研究[J]. 计算机工程与应用, 2011,47(13):150-153.
doi: 10.3778/j.issn.1002-8331.2011.13.043
[9] ( Yao Quanzhu, Song Zhili, Peng Cheng . Research on Text Categorization Based on LDA[J]. Computer Engineering and Applications, 2011,47(13):150-153.)
doi: 10.3778/j.issn.1002-8331.2011.13.043
[10] Routray S, Ray A K, Mishra C , et al. Efficient Hybrid Image Denoising Scheme Based on SVM Classification[J]. Optik, 2018,157:503-511.
[11] 魏勇 . 关联语义结合卷积神经网络的文本分类方法[J]. 控制工程, 2018,25(2):367-370.
[11] ( Wei Yong . A Text Classification Method Based on Associative Semantics and Convolution Neural Network[J]. Control Engineering of China, 2018,25(2):367-370.)
[12] 谢志峰, 吴佳萍, 马利庄 . 基于卷积神经网络的中文财经新闻分类方法[J]. 山东大学学报: 工学版, 2018,48(3):34-39, 66.
[12] ( Xie Zhifeng, Wu Jiaping, Ma Lizhuang . Chinese Financial News Classification Method Based on Convolutional Neural Network[J]. Journal of Shandong University: Engineering Science, 2018,48(3):34-39, 66.)
[13] 卢玲, 杨武, 王远伦 , 等. 结合注意力机制的长文本分类方法[J]. 计算机应用, 2018,38(5):1272-1277.
[13] ( Lu Ling, Yang Wu, Wang Yuanlun , et al. Long Text Classification Combined with Attention Mechanism[J]. Journal of Computer Applications, 2018,38(5):1272-1277.)
[14] Sabour S, Frosst N, Hinton G E. Dynamic Routing Between Capsules [C]// Proceedings of the 31st Conference on Neural Information Processing Systems. 2017: 3856-3866.
[15] Afshar P, Mohammadi A, Plataniotis K N. Brain Tumor Type Classification via Capsule Networks [C]// Proceedings of the 25th IEEE International Conference on Image Processing. 2018: 3129-3133.
[16] Zhao Z, Wu Y. Attention-based Convolutional Neural Networks for Sentence Classification [C]// Proceedings of the 2016 Annual Conference of the International Speech Communication Association, San Francisico, CA, USA. ISCA, 2016: 705-709.
[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] Yang Hanxun, Zhou Dequn, Ma Jing, Luo Yongcong. Detecting Rumors with Uncertain Loss and Task-level Attention Mechanism[J]. 数据分析与知识发现, 2021, 5(7): 101-110.
[4] Yin Pengbo,Pan Weimin,Zhang Haijun,Chen Degang. Identifying Clickbait with BERT-BiGA Model[J]. 数据分析与知识发现, 2021, 5(6): 126-134.
[5] Yu Bengong,Zhu Xiaojie,Zhang Ziwei. A Capsule Network Model for Text Classification with Multi-level Feature Extraction[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[6] Xie Hao,Mao Jin,Li Gang. Sentiment Classification of Image-Text Information with Multi-Layer Semantic Fusion[J]. 数据分析与知识发现, 2021, 5(6): 103-114.
[7] Han Pu,Zhang Zhanpeng,Zhang Mingtao,Gu Liang. Normalizing Chinese Disease Names with Multi-feature Fusion[J]. 数据分析与知识发现, 2021, 5(5): 83-94.
[8] Duan Jianyong,Wei Xiaopeng,Wang Hao. A Multi-Perspective Co-Matching Model for Machine Reading Comprehension[J]. 数据分析与知识发现, 2021, 5(4): 134-141.
[9] Wang Yuzhu,Xie Jun,Chen Bo,Xu Xinying. Multi-modal Sentiment Analysis Based on Cross-modal Context-aware Attention[J]. 数据分析与知识发现, 2021, 5(4): 49-59.
[10] Wang Yan, Wang Huyan, Yu Bengong. Chinese Text Classification with Feature Fusion[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[11] Jiang Cuiqing,Wang Xiangxiang,Wang Zhao. Forecasting Car Sales Based on Consumer Attention[J]. 数据分析与知识发现, 2021, 5(1): 128-139.
[12] Huang Lu,Zhou Enguo,Li Daifeng. Text Representation Learning Model Based on Attention Mechanism with Task-specific Information[J]. 数据分析与知识发现, 2020, 4(9): 111-122.
[13] Yin Haoran,Cao Jinxuan,Cao Luzhe,Wang Guodong. Identifying Emergency Elements Based on BiGRU-AM Model with Extended Semantic Dimension[J]. 数据分析与知识发现, 2020, 4(9): 91-99.
[14] Wang Sidi,Hu Guangwei,Yang Siyu,Shi Yun. Automatic Transferring Government Website E-Mails Based on Text Classification[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[15] Shi Lei,Wang Yi,Cheng Ying,Wei Ruibin. Review of Attention Mechanism in Natural Language Processing[J]. 数据分析与知识发现, 2020, 4(5): 1-14.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn