Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (9): 60-67    DOI: 10.11925/infotech.2096-3467.2018.1423
Current Issue | Archive | Adv Search |
Classifying Short-texts with Class Feature Extension
Yunfei Shao(),Dongsu Liu
School of Economics of Management, Xidian University, Xi’an 710126, China
Download: PDF(462 KB)   HTML ( 14
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a short text classification method based on category feature extension, aiming to address the issue of sparse content in short texts. [Methods] We used the improved TF-IDF model and LDA topic model to construct the keyword set and topic distribution set, which were all based on category features. Then, we expanded the content and vector representations of short texts. Finally, we classified short texts with the help of convolutional neural network. [Results] The classification precision rate of the proposed method was improved by 3.0%, and the recall rate was improved by 4.1%. [Limitations] Only examined the new method with convolutional neural network. [Conclusions] The proposed method can improve the effectiveness of categorization procedures for short texts.

Key wordsWord Vector      LDA Model      CNN      Short-Text Classification     
Received: 18 December 2018      Published: 23 October 2019
:  G35  

Cite this article:

Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension. Data Analysis and Knowledge Discovery, 2019, 3(9): 60-67.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.1423     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I9/60

主题 主题词及概率值
Topic1 上市: 0.064 股票: 0.042 科技: 0.039 散户: 0.021
Topic2 段位: 0.045 战神: 0.042 动作: 0.022 阴阳师: 0.076
··· ··· ··· ··· ···
Topic80 小学生: 0.032 老师: 0.071 发展: 0.069 报考: 0.043
文本类别 训练集数据 测试集数据
体育 3 150 1 350
教育 2 660 1 140
经济 2 100 900
游戏 2 800 1 200
房产 2 590 1 110
参数名称 参数设置 参数名称 参数设置
池化方法 1-max pooling Droupout_prob 0.5
卷积核大小 2,3,4 训练次数 10
卷积核个数 100 优化器 AdamOptimizer
Batch_size 128 学习率 0.01
分类为A类 分类非A类
实际为A类 TP FN
实际非A类 FP TN
分类方法 准确率(均值) 召回率(均值) F1值
VSM+KNN 67.3% 65.0% 66.1%
LDA+KNN 60.1% 55.4% 57.7%
TextCNN+预训练词向量 88.3% 87.5% 87.9%
TextCNN+本文扩展向量 91.3% 91.6% 91.4%
[1] 王峥, 刘师培, 彭艳兵 . 基于句法决策树和SVM的短文本语境识别模型[J]. 计算机与现代化, 2017(3):13-17.
[1] ( Wang Zheng, Liu Shipei, Peng Yanbing . An Essay Context Recognition Model Based on Syntax Decision Tree and SVM Algorithm[J]. Computer and Modernization, 2017(3):13-17.)
[2] 李静梅, 孙丽华, 张巧荣 , 等. 一种文本处理中的朴素贝叶斯分类器[J]. 哈尔滨工程大学学报, 2003,24(1):71-74.
[2] ( Li Jingmei, Sun Lihua, Zhang Qiaorong , et al. Application of Navie Bayes Classifier to Text Classification[J]. Journal of Harbin Engineering University, 2003,24(1):71-74.)
[3] 范云杰, 刘怀亮 . 基于维基百科的中文短文本分类研究[J]. 现代图书情报技术, 2012(3):47-52.
[3] ( Fan Yunjie, Liu Huailiang . Research on Chinese Short Text Classification Based on Wikipedia[J]. New Technology of Library and Information Service, 2012(3):47-52.)
[4] 李湘东, 阮涛, 刘康 . 基于维基百科的多种类型文献自动分类研究[J]. 数据分析与知识发现, 2017,1(10):43-52.
[4] ( Li Xiangdong, Ruan Tao, Liu Kang . Research on Automatic Classification of Various Documents Based on Wikipedia[J]. Data Analysis and Knowledge Discovery, 2017,1(10):43-52.)
[5] 丁连红, 孙斌, 张宏伟 . 基于知识图谱扩展的短文本分类方法[J]. 情报工程, 2018,4(5):38-46.
[5] ( Ding Lianhong, Sun Bin, Zhang Hongwei . Short Text Classification Based on Knowledge Graph Extension[J]. Technology Intelligence Engineering, 2018,4(5):38-46.)
[6] Fan X, Hu H. A New Model for Chinese Short-text Classification Considering Feature Extension [C]// Proceedings of the 2010 International Conference on Artificial Intelligence and Computational Intelligence. 2010,2:7-11.
[7] 袁满, 欧阳元新, 熊璋 , 等. 一种基于频繁词集的短文本特征扩展方法[J]. 东南大学学报: 自然科学版, 2014,44(2):256-260.
[7] ( Yuan Man, Ouyang Yuanxin, Xiong Zhang , et al. Short Text Feature Extension Method Based on Frequent Term Sets[J]. Journal of Southeast University: Natural Science Edition, 2014,44(2):256-260.)
[8] Blei D M, Ng A Y, Jordan M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[9] 吕超镇, 姬东鸿, 吴飞飞 . 基于LDA特征扩展的短文本分类[J]. 计算机工程与应用, 2015,51(4):123-127.
[9] ( Lv Chaozhen, Ji Donghong, Wu Feifei . Short Text Classification Based on Expanding Feature of LDA[J]. Computer Engineering and Applications, 2015,51(4):123-127.)
[10] 胡勇军, 江嘉欣, 常会友 . 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013(6):42-48.
[10] ( Hu Yongjun, Jiang Jiaxin, Chang Huiyou . A New Method of Keywords Extraction for Chinese Short Text Classification[J]. New Technology of Library and Information Service, 2013(6):42-48.)
[11] 张群, 王红军, 王伦文 . 词向量与LDA相融合的短文本分类方法[J]. 现代图书情报技术, 2016(12):27-35.
[11] ( Zhang Qun, Wang Hongjun, Wang Lunwen . Classifying Short Texts with Word Embedding and LDA Model[J]. New Technology of Library and Information Service, 2016(12):27-35.)
[12] 雷朔, 刘旭敏, 徐维祥 . 基于词向量特征扩展的中文短文本分类研究[J]. 计算机应用与软件, 2018,35(8):269-274.
[12] ( Lei Shuo, Liu Xumin, Xu Weixiang . Chinese Short Text Classification Based on Word Vector Extension[J]. Computer Applications and Software, 2018,35(8):269-274.)
[13] 覃世安, 李法运 . 文本分类中TF-IDF方法的改进研究[J]. 现代图书情报技术, 2013(10):27-30.
[13] ( Qin Shian, Li Fayun . Improved TF-IDF Method in Text Classification[J]. New Technology of Library and Information Service, 2013(10):27-30.)
[14] Kim Y . Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408. 5882.
[15] GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation (LDA) Using Gibbs Sampling for Parameter Estimation and Inference[EB/OL]. [2016-05-15].https://sourceforge.net/projects/jgibblda/.
[16] 黄贤英, 熊李媛, 刘英涛 , 等. 基于类别特征改进的KNN短文本分类算法[J]. 计算机工程与科学, 2018,40(1):148-154.
[16] ( Huang Xianying, Xiong Liyuan, Liu Yingtao , et al. An Improved KNN Short Text Classification Algorithm Based on Category Feature Words[J]. Computer Engineering & Science, 2018,40(1):148-154.)
[1] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[2] Xiuxian Wen,Jian Xu. Research on Product Characteristics Extraction and Hedonic Price Based on User Comments[J]. 数据分析与知识发现, 2019, 3(7): 42-51.
[3] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[4] Hui Li,Yaqing Chai. Fine-Grained Sentiment Analysis Based on Convolutional Neural Network[J]. 数据分析与知识发现, 2019, 3(1): 95-103.
[5] Yanhua Xu,Yujie Miao,Lin Miao,Xueqiang Lv. Generating HSK Writing Essays with LDA Model[J]. 数据分析与知识发现, 2018, 2(9): 80-87.
[6] Xinlei Li,Hao Wang,Xiaomin Liu,Sanhong Deng. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[7] Li Wang,Lixue Zou,Xiwen Liu. Visualizing Document Correlation Based on LDA Model[J]. 数据分析与知识发现, 2018, 2(3): 98-106.
[8] Jingqi Wang,Rui Li,Huayi Wu. The Evolution of Online Public Opinion Based on Spatial Autocorrelation[J]. 数据分析与知识发现, 2018, 2(2): 64-73.
[9] Guoming Feng,Xiaodong Zhang,Suhui Liu. Classifying Chinese Texts with CapsNet[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[10] Yang Zhao,Qiqi Li,Yuhan Chen,Wenhang Cao. Examining Consumer Reviews of Overseas Shopping APP with Sentiment Analysis[J]. 数据分析与知识发现, 2018, 2(11): 19-27.
[11] Jiaheng Hu,Yonghua Cen,Chengyao Wu. Constructing Sentiment Dictionary with Deep Learning: Case Study of Financial Data[J]. 数据分析与知识发现, 2018, 2(10): 95-102.
[12] Zhen Li,Shengchun Ding,Nan Wang. Identifying Topics of Online Public Opinion[J]. 数据分析与知识发现, 2017, 1(8): 18-30.
[13] Xiaofei Fang,Xiaoxi Huang,Rongbo Wang,Zhiqun Chen,Xiaohua Wang. Identifying Hot Topics from Mobile Complaint Texts[J]. 数据分析与知识发现, 2017, 1(2): 19-27.
[14] Dongsheng Zhai,Dengjin Hu,Jie Zhang,Xijun He,He Liu. Hierarchical Classification Model for Invention Patents[J]. 数据分析与知识发现, 2017, 1(12): 63-73.
[15] Ning Jianfei,Liu Jiangzhen. Using Word2vec with TextRank to Extract Keywords[J]. 现代图书情报技术, 2016, 32(6): 20-27.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn