Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (10): 89-97    DOI: 10.11925/infotech.2096-3467.2019.0081
Current Issue | Archive | Adv Search |
Classifying Texts with KACC Model
Yuman Li,Zhibo Chen(),Fu Xu
School of Information Science & Technology, Beijing Forestry University, Beijing 100083, China
Download: PDF(710 KB)   HTML ( 10
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to improve the quality of text representation, and correlate contents with text label vectors, aiming to improve the classification results. [Methods] Firstly, we modified the keyword extraction method (KE). We used the keyword vectors to represent the text, and adopted a category label representation algorithm (CLR) to create the text vectors. Then, we employed the attention-based capsule network (Attention-Capsnet) as the classifier, to construct the KACC (KE-Attention-Capsnet-CLR) model. Finally, we compared our classification results with other methods. [Results] KACC model effectively improved the data quality, which led to better Precision, Recall and F-Measure than existing models. The classification precision reached 97.4%. [Limitations] The experimental data size needs to be expanded, and more research is needed to examine the category discrimination rules with other corpuses. [Conclusions] KACC model is an effective classification model for texts.

Key wordsText Classification      Keywords Extraction      Attention Mechanism      Capsule Network      Category Label Representation     
Received: 18 January 2019      Published: 25 November 2019
ZTFLH:  TP391  
Corresponding Authors: Zhibo Chen     E-mail: zhibo@bjfu.edu.cn

Cite this article:

Yuman Li,Zhibo Chen,Fu Xu. Classifying Texts with KACC Model. Data Analysis and Knowledge Discovery, 2019, 3(10): 89-97.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.0081     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I10/89

评价指标
模型序号
MRR Bpref P R F
1 0.6429 0.7739 0.7491 0.6057 0.6698
2 0.6330 0.7635 0.7226 0.5903 0.6498
3 0.5251 0.7394 0.6240 0.4934 0.5511
4 0.6070 0.7306 0.6741 0.5157 0.5844
5 0.5995 0.7177 0.6276 0.4852 0.5473
6 0.6861 0.7820 0.7776 0.5914 0.6718
类别
数据集
训练集(篇) 测试集(篇) 合计词数(个)
文化 800 200 203.6万
娱乐 800 200 59.6万
历史 800 200 363.3万
军事 800 200 135.4万
文学 800 200 82.2万
合计词数(个) 677.3万 166.8万 844.1万
序号 模型 P R F1
1 FT+CNN+OneHot 0.891 0.885 0.888
2 KE+CNN+OneHot 0.839 0.870 0.854
3 KE+Attention+CNN+OneHot 0.888 0.860 0.874
4 KE+Attention+CNN+CLR 0.901 0.895 0.898
5 KE+Capsnet+OneHot 0.889 0.900 0.894
6 KE+Attention+Capsnet+OneHot 0.954 0.925 0.939
7 KE+Attention+Capsnet+CLR 0.974 0.970 0.972
类别
模型
文学 文化 历史 军事 娱乐
1 0.815 0.840 0.885 0.930 0.985
2 0.770 0.765 0.855 0.870 0.935
3 0.830 0.845 0.890 0.910 0.965
4 0.875 0.880 0.905 0.915 0.930
5 0.800 0.805 0.945 0.910 0.985
6 0.925 0.920 0.955 0.970 1.000
7 0.945 0.950 0.975 1.000 1.000
Avg 0.852 0.858 0.916 0.929 0.971
区分度 0.876 0.882 0.943 0.969 0.998
[1] 江伟, 金忠 . 基于短语注意机制的文本分类[J]. 中文信息学报, 2018,32(2):102-109, 119.
[1] ( Jiang Wei, Jin Zhong . Text Classification Based on Phrase Attention Mechanism[J]. Journal of Chinese Information Processing, 2018,32(2):102-109, 119.)
[2] 孙飞, 郭嘉丰, 兰艳艳 , 等. 面向文本分类的有监督显式语义表示[J]. 数据采集与处理, 2017,32(3):550-558.
[2] ( Sun Fei, Guo Jiafeng, Lan Yanyan , et al. Supervised Explicit Semantic Representation for Text Categorization[J]. Journal of Data Acquisition and Processing, 2017,32(3):550-558.)
[3] Salton G, Yu C T. On the Construction of Effective Vocabularies for Information Retrieval [C]// Proceedings of the 1973 Meeting on Programming Languages and Information Retrieval. ACM, 1973: 48-60.
[4] 杨凯艳 . 基于改进的TFIDF关键词自动提取算法研究[D]. 湘潭: 湘潭大学, 2015.
[4] ( Yang Kaiyan . Research on Automatic Keyword Extraction Algorithm Based on Improved TFIDF[D]. Xiangtan:Xiangtan University, 2015.)
[5] 程岚岚 . 面向领域的中文搜索引擎若干关键技术研究[D]. 天津: 天津大学, 2006.
[5] ( Cheng Lanlan . The Study of Key Technologies for Chinese Domain-Oriented Search Engine[D]. Tianjin: Tianjin University, 2006.)
[6] 李华灿 . 基于统计与协同过滤的关键词提取研究[D]. 西安: 西安电子科技大学, 2015.
[6] ( Li Huacan . Keyword Extraction Base on Statistical and Collaborative Filtering[D]. Xi’an: Xidian University, 2015.)
[7] 谢晋 . 基于词跨度的中文文本关键词提取及在文本分类中的应用[D]. 杭州: 浙江工业大学, 2011.
[7] ( Xie Jin . Chinese Keyword Extraction Method Based on Word Span and Its Application in Text Classification[D]. Hangzhou: Zhejiang University of Technology, 2011.)
[8] 陈凯, 黄英来, 高文韬 , 等. 一种基于属性加权补集的朴素贝叶斯文本分类算法[J]. 哈尔滨理工大学学报, 2018,23(4):69-74.
[8] ( Chen Kai, Huang Yinglai, Gao Wentao , et al. An Improved Naive Bayesian Text Classification Algorithm Based on Weighted Features and Its Complementary Set[J]. Journal of Harbin University of Science and Technology, 2018,23(4):69-74.)
[9] 姚全珠, 宋志理, 彭程 . 基于LDA模型的文本分类研究[J]. 计算机工程与应用, 2011,47(13):150-153.
doi: 10.3778/j.issn.1002-8331.2011.13.043
[9] ( Yao Quanzhu, Song Zhili, Peng Cheng . Research on Text Categorization Based on LDA[J]. Computer Engineering and Applications, 2011,47(13):150-153.)
doi: 10.3778/j.issn.1002-8331.2011.13.043
[10] Routray S, Ray A K, Mishra C , et al. Efficient Hybrid Image Denoising Scheme Based on SVM Classification[J]. Optik, 2018,157:503-511.
[11] 魏勇 . 关联语义结合卷积神经网络的文本分类方法[J]. 控制工程, 2018,25(2):367-370.
[11] ( Wei Yong . A Text Classification Method Based on Associative Semantics and Convolution Neural Network[J]. Control Engineering of China, 2018,25(2):367-370.)
[12] 谢志峰, 吴佳萍, 马利庄 . 基于卷积神经网络的中文财经新闻分类方法[J]. 山东大学学报: 工学版, 2018,48(3):34-39, 66.
[12] ( Xie Zhifeng, Wu Jiaping, Ma Lizhuang . Chinese Financial News Classification Method Based on Convolutional Neural Network[J]. Journal of Shandong University: Engineering Science, 2018,48(3):34-39, 66.)
[13] 卢玲, 杨武, 王远伦 , 等. 结合注意力机制的长文本分类方法[J]. 计算机应用, 2018,38(5):1272-1277.
[13] ( Lu Ling, Yang Wu, Wang Yuanlun , et al. Long Text Classification Combined with Attention Mechanism[J]. Journal of Computer Applications, 2018,38(5):1272-1277.)
[14] Sabour S, Frosst N, Hinton G E. Dynamic Routing Between Capsules [C]// Proceedings of the 31st Conference on Neural Information Processing Systems. 2017: 3856-3866.
[15] Afshar P, Mohammadi A, Plataniotis K N. Brain Tumor Type Classification via Capsule Networks [C]// Proceedings of the 25th IEEE International Conference on Image Processing. 2018: 3129-3133.
[16] Zhao Z, Wu Y. Attention-based Convolutional Neural Networks for Sentence Classification [C]// Proceedings of the 2016 Annual Conference of the International Speech Communication Association, San Francisico, CA, USA. ISCA, 2016: 705-709.
[1] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[2] Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[3] Heran Qin,Liu Liu,Bin Li,Dongbo Wang. Automatic Classification of Ancient Classics with Entity Features[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[4] Guo Chen,Tianxiang Xu. Sentence Function Recognition Based on Active Learning[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[5] Mingzhu Sun,Jing Ma,Lingfei Qian. Extracting Keywords Based on Topic Structure and Word Diagram Iteration[J]. 数据分析与知识发现, 2019, 3(8): 68-76.
[6] Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[7] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[8] Yuemin Wu,Ganggui Ding,Bin Hu. Extracting Relationship of Agricultural Financial Texts with Attention Mechanism[J]. 数据分析与知识发现, 2019, 3(5): 86-92.
[9] Zhen Zhang,Jin Zeng. Extracting Keywords from User Comments: Case Study of Meituan[J]. 数据分析与知识发现, 2019, 3(3): 36-44.
[10] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[11] Xinlei Li,Hao Wang,Xiaomin Liu,Sanhong Deng. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[12] Liu Liu,Dongbo Wang. Identifying Interdisciplinary Social Science Research Based on Article Classification[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[13] Xiangdong Li,Tao Ruan,Kang Liu. Automatic Classification of Documents from Wikipedia[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[14] Yonghe Lu,Jinghuang Chen. Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[15] Qun Zhang, Hongjun Wang, Lunwen Wang. Classifying Short Texts with Word Embedding and LDA Model[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn