Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (9): 60-67    DOI: 10.11925/infotech.2096-3467.2018.1423
Current Issue | Archive | Adv Search |
Classifying Short-texts with Class Feature Extension
Yunfei Shao(),Dongsu Liu
School of Economics of Management, Xidian University, Xi’an 710126, China
Download: PDF (462 KB)   HTML ( 17
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a short text classification method based on category feature extension, aiming to address the issue of sparse content in short texts. [Methods] We used the improved TF-IDF model and LDA topic model to construct the keyword set and topic distribution set, which were all based on category features. Then, we expanded the content and vector representations of short texts. Finally, we classified short texts with the help of convolutional neural network. [Results] The classification precision rate of the proposed method was improved by 3.0%, and the recall rate was improved by 4.1%. [Limitations] Only examined the new method with convolutional neural network. [Conclusions] The proposed method can improve the effectiveness of categorization procedures for short texts.

Key wordsWord Vector      LDA Model      CNN      Short-Text Classification     
Received: 18 December 2018      Published: 23 October 2019
ZTFLH:  G35  

Cite this article:

Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension. Data Analysis and Knowledge Discovery, 2019, 3(9): 60-67.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.1423     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I9/60

主题 主题词及概率值
Topic1 上市: 0.064 股票: 0.042 科技: 0.039 散户: 0.021
Topic2 段位: 0.045 战神: 0.042 动作: 0.022 阴阳师: 0.076
··· ··· ··· ··· ···
Topic80 小学生: 0.032 老师: 0.071 发展: 0.069 报考: 0.043
文本类别 训练集数据 测试集数据
体育 3 150 1 350
教育 2 660 1 140
经济 2 100 900
游戏 2 800 1 200
房产 2 590 1 110
参数名称 参数设置 参数名称 参数设置
池化方法 1-max pooling Droupout_prob 0.5
卷积核大小 2,3,4 训练次数 10
卷积核个数 100 优化器 AdamOptimizer
Batch_size 128 学习率 0.01
分类为A类 分类非A类
实际为A类 TP FN
实际非A类 FP TN
分类方法 准确率(均值) 召回率(均值) F1值
VSM+KNN 67.3% 65.0% 66.1%
LDA+KNN 60.1% 55.4% 57.7%
TextCNN+预训练词向量 88.3% 87.5% 87.9%
TextCNN+本文扩展向量 91.3% 91.6% 91.4%
[1] 王峥, 刘师培, 彭艳兵 . 基于句法决策树和SVM的短文本语境识别模型[J]. 计算机与现代化, 2017(3):13-17.
[1] ( Wang Zheng, Liu Shipei, Peng Yanbing . An Essay Context Recognition Model Based on Syntax Decision Tree and SVM Algorithm[J]. Computer and Modernization, 2017(3):13-17.)
[2] 李静梅, 孙丽华, 张巧荣 , 等. 一种文本处理中的朴素贝叶斯分类器[J]. 哈尔滨工程大学学报, 2003,24(1):71-74.
[2] ( Li Jingmei, Sun Lihua, Zhang Qiaorong , et al. Application of Navie Bayes Classifier to Text Classification[J]. Journal of Harbin Engineering University, 2003,24(1):71-74.)
[3] 范云杰, 刘怀亮 . 基于维基百科的中文短文本分类研究[J]. 现代图书情报技术, 2012(3):47-52.
[3] ( Fan Yunjie, Liu Huailiang . Research on Chinese Short Text Classification Based on Wikipedia[J]. New Technology of Library and Information Service, 2012(3):47-52.)
[4] 李湘东, 阮涛, 刘康 . 基于维基百科的多种类型文献自动分类研究[J]. 数据分析与知识发现, 2017,1(10):43-52.
[4] ( Li Xiangdong, Ruan Tao, Liu Kang . Research on Automatic Classification of Various Documents Based on Wikipedia[J]. Data Analysis and Knowledge Discovery, 2017,1(10):43-52.)
[5] 丁连红, 孙斌, 张宏伟 . 基于知识图谱扩展的短文本分类方法[J]. 情报工程, 2018,4(5):38-46.
[5] ( Ding Lianhong, Sun Bin, Zhang Hongwei . Short Text Classification Based on Knowledge Graph Extension[J]. Technology Intelligence Engineering, 2018,4(5):38-46.)
[6] Fan X, Hu H. A New Model for Chinese Short-text Classification Considering Feature Extension [C]// Proceedings of the 2010 International Conference on Artificial Intelligence and Computational Intelligence. 2010,2:7-11.
[7] 袁满, 欧阳元新, 熊璋 , 等. 一种基于频繁词集的短文本特征扩展方法[J]. 东南大学学报: 自然科学版, 2014,44(2):256-260.
[7] ( Yuan Man, Ouyang Yuanxin, Xiong Zhang , et al. Short Text Feature Extension Method Based on Frequent Term Sets[J]. Journal of Southeast University: Natural Science Edition, 2014,44(2):256-260.)
[8] Blei D M, Ng A Y, Jordan M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[9] 吕超镇, 姬东鸿, 吴飞飞 . 基于LDA特征扩展的短文本分类[J]. 计算机工程与应用, 2015,51(4):123-127.
[9] ( Lv Chaozhen, Ji Donghong, Wu Feifei . Short Text Classification Based on Expanding Feature of LDA[J]. Computer Engineering and Applications, 2015,51(4):123-127.)
[10] 胡勇军, 江嘉欣, 常会友 . 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013(6):42-48.
[10] ( Hu Yongjun, Jiang Jiaxin, Chang Huiyou . A New Method of Keywords Extraction for Chinese Short Text Classification[J]. New Technology of Library and Information Service, 2013(6):42-48.)
[11] 张群, 王红军, 王伦文 . 词向量与LDA相融合的短文本分类方法[J]. 现代图书情报技术, 2016(12):27-35.
[11] ( Zhang Qun, Wang Hongjun, Wang Lunwen . Classifying Short Texts with Word Embedding and LDA Model[J]. New Technology of Library and Information Service, 2016(12):27-35.)
[12] 雷朔, 刘旭敏, 徐维祥 . 基于词向量特征扩展的中文短文本分类研究[J]. 计算机应用与软件, 2018,35(8):269-274.
[12] ( Lei Shuo, Liu Xumin, Xu Weixiang . Chinese Short Text Classification Based on Word Vector Extension[J]. Computer Applications and Software, 2018,35(8):269-274.)
[13] 覃世安, 李法运 . 文本分类中TF-IDF方法的改进研究[J]. 现代图书情报技术, 2013(10):27-30.
[13] ( Qin Shian, Li Fayun . Improved TF-IDF Method in Text Classification[J]. New Technology of Library and Information Service, 2013(10):27-30.)
[14] Kim Y . Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408. 5882.
[15] GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation (LDA) Using Gibbs Sampling for Parameter Estimation and Inference[EB/OL]. [2016-05-15].https://sourceforge.net/projects/jgibblda/.
[16] 黄贤英, 熊李媛, 刘英涛 , 等. 基于类别特征改进的KNN短文本分类算法[J]. 计算机工程与科学, 2018,40(1):148-154.
[16] ( Huang Xianying, Xiong Liyuan, Liu Yingtao , et al. An Improved KNN Short Text Classification Algorithm Based on Category Feature Words[J]. Computer Engineering & Science, 2018,40(1):148-154.)
[1] Fan Shaoping,Zhao Yuxuan,An Xinying,Wu Qingqiang. Classification Model for Medical Entity Relations with Convolutional Neural Network[J]. 数据分析与知识发现, 2021, 5(9): 75-84.
[2] Zhang Jiandong, Chen Shiji, Xu Xiaoting, Zuo Wenge. Extracting PDF Tables Based on Word Vectors[J]. 数据分析与知识发现, 2021, 5(8): 34-44.
[3] Wang Hao, Lin Kerou, Meng Zhen, Li Xinlei. Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[4] Meng Zhen,Wang Hao,Yu Wei,Deng Sanhong,Zhang Baolong. Vocal Music Classification Based on Multi-category Feature Fusion[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[5] Dong Miao, Su Zhongqi, Zhou Xiaobei, Lan Xue, Cui Zhigang, Cui Lei. Improving PubMedBERT for CID-Entity-Relation Classification Using Text-CNN[J]. 数据分析与知识发现, 2021, 5(11): 145-152.
[6] Dai Zhihong, Hao Xiaoling. Extracting Hypernym-Hyponym Relationship for Financial Market Applications[J]. 数据分析与知识发现, 2021, 5(10): 60-70.
[7] Dai Jianhua, Deng Yubin. Extracting Emotion-Cause Pairs Based on Emotional Dilation Gated CNN[J]. 数据分析与知识发现, 2020, 4(8): 98-106.
[8] Weng Mengjuan,Yao Changqing,Han Hongqi,Wang Lijun,Ran Yaxin. Classification and Indexing Method with CNN for Imbalanced Datasets[J]. 数据分析与知识发现, 2020, 4(7): 87-95.
[9] Cai Yongming,Liu Lu,Wang Kewei. Identifying Key Users and Topics from Online Learning Community[J]. 数据分析与知识发现, 2020, 4(6): 69-79.
[10] Liu Yuwen,Wang Kai. Finding Geographic Locations of Popular Online Topics[J]. 数据分析与知识发现, 2020, 4(2/3): 173-181.
[11] Peng Chen,Lv Xueqiang,Sun Ning,Zang Le,Jiang Zhaocai,Song Li. Building Phrase Dictionary for Defective Products with Convolutional Neural Network[J]. 数据分析与知识发现, 2020, 4(11): 112-120.
[12] Ye Guanghui,Xu Tong,Bi Chongwu,Li Xinyue. Analyzing Evolution of City Tourism Portraits with Multi-Dimensional Features and LDA Model[J]. 数据分析与知识发现, 2020, 4(11): 121-130.
[13] Na Ma,Zhixiong Zhang,Pengmin Wu. Automatic Identification of Term Citation Object with Feature Fusion[J]. 数据分析与知识发现, 2020, 4(1): 89-98.
[14] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[15] Xiuxian Wen,Jian Xu. Research on Product Characteristics Extraction and Hedonic Price Based on User Comments[J]. 数据分析与知识发现, 2019, 3(7): 42-51.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn