Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (1): 111-120    DOI: 10.11925/infotech.2096-3467.2019.0790
Current Issue | Archive | Adv Search |
Classification of Short Texts Based on nLD-SVM-RF Model
Bengong Yu1,2,Yumeng Cao1(),Yangnan Chen1,Ying Yang1,2
1School of Management, Hefei University of Technology, Hefei 230009, China
2Key Laboratory of Process Optimization & Intelligent Decision-making, Ministry of Education,Hefei University of Technology, Hefei 230009, China
Download: PDF(1105 KB)   HTML ( 23
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper addresses the issue of data sparseness due to short texts, which also improves the performance of short texts classification.[Methods] We proposed a multi-channel text model for the input of short text classifier by integrating the semantics, word order features and topic features. Then, we created the classification method named nLD-SVM-RF with the help of SVM and random forest algorithms. Finally, we examined the new model with short text of complaints.[Results] We compared the performance of our new model with the SVM and RF single classifiers using Doc2vec as the feature. When n =5, the accuracy of the nLD-SVM-RF method increased by 9.70% and 6.25%, respectively.[Limitations] The experimental data size needs to be expanded.[Conclusions] The nLD-SVM-RF model provides a practical solution for the business community to analyse short texts and improve decision-making.

Key wordsShort Text Classification      Multi-Channel Modelling      SVM      Random Forest      Ensemble Learning      nLD-SVM-RF     
Received: 03 July 2019      Published: 14 March 2020
ZTFLH:  G254.1  
Corresponding Authors: Yumeng Cao     E-mail: caoyumeng1029@163.com

Cite this article:

Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model. Data Analysis and Knowledge Discovery, 2020, 4(1): 111-120.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.0790     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I1/111

Framework of the nLD-SVM-RF Model
Multi-channel Feature Fusion of Short Text
Doc2Vec Model
Text Modeling Comparison of LDA, Doc2Vec, LD Multi-channel Short Text Features
Comparison of LD-KNN, LD-DecisionTree, LD-SVM, LD-RF, LD-SVM-RF Classification Effects
Comparison of Classification Effects of n = 1, 3, 5, 7, 9
[1] 梁昕露, 李美娟 . 电信业投诉分类方法及其应用研究[J]. 中国管理科学, 2015,23(S1):188-192.
[1] ( Liang Xinlu, Li Meijuan . Text Categorization of Complain in Telecommunication Industry and Its Applied Research[J]. Chinese Journal of Management Science, 2015,23(S1):188-192.)
[2] Blei D M, Ng A Y, Jordan M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[3] 周源, 刘怀兰, 杜朋朋 , 等. 基于改进TF-IDF特征提取的文本分类模型研究[J]. 情报科学, 2017,35(5):111-118.
[3] ( Zhou Yuan, Liu Huailan, Du Pengpeng , et al. Research of Text Classification Model Based on the Improved TF-IDF Feature Extraction[J]. Information Science, 2017,35(5):111-118.)
[4] 马建红, 刘广森, 姚爽 , 等. 面向短文本的特征选择及文本表示[J].计算机与现代化, 2019(3):95-101,126.
[4] ( Ma Jianhong, Liu Guangsen, Yao Shuang , et al. Text Feature Selection and Text Representation for Short Essays[J].Computer and Modernization, 2019(3):95-101,126.)
[5] 李湘东, 阮涛, 刘康 . 基于维基百科的多种类型文献自动分类研究[J]. 数据分析与知识发现, 2017,1(10):43-52.
[5] ( Li Xiangdong, Ruan Tao, Liu Kang . Automatic Classification of Documents from Wikipedia[J]. Data Analysis and Knowledge Discovery, 2017,1(10):43-52.)
[6] 岳文应 . 基于Doc2Vec与SVM的聊天内容过滤[J]. 计算机系统应用, 2018,27(7):127-132.
[6] ( Yue Wenying . Chat Content Filtering Based on Doc2Vec and SVM[J]. Computer Systems & Applications, 2018,27(7):127-132.)
[7] 胡勇军, 江嘉欣, 常会友 . 基于LDA高频词扩展的中文短文本分类[J].现代图书情报技术, 2013(6):42-48.
[7] ( Hu Yongjun, Jiang Jiaxin, Chang Huiyou . A New Method of Keywords Extraction for Chinese Short-text Classification[J]. New Technology of Library and Information Service, 2013(6):42-48.)
[8] Burkhardt S, Kramer S . Online Multi-Label Dependency Topic Models for Text Classification[J]. Machine Learning, 2018,107(5):859-886.
[9] Zhang H, Zhong G . Improving Short Text Classification by Learning Vector Representations of Both Words and Hidden Topics[J]. Knowledge-Based Systems, 2016,102:76-86.
[10] Blei D. Probabilistic Topic Models [C]// Proceedings of the 17th ACM SIGKDD International Conference Tutorials. 2011.
[11] 贺鸣, 孙建军, 成颖 . 基于朴素贝叶斯的文本分类研究综述[J]. 情报科学, 2016,34(7):147-154.
[11] ( He Ming, Sun Jianjun, Cheng Ying . Text Classification Based on Naive Bayes:A Review[J]. Information Science, 2016,34(7):147-154.)
[12] 樊兴华, 王鹏 . 大连海事大学学报[J].大连海事大学学报,2008(3):121-124.
[12] ( Fan Xinghua, Wang Peng . Chinese Short-Text Classification in Two-Steps[J].Journal of Dalian Maritime University, 2008(3):121-124.)
[13] 孙建旺, 吕学强, 张雷瀚 . 基于语义与最大匹配度的短文本分类研究[J]. 计算机工程与设计, 2013,34(10):3613-3618.
[13] ( Sun Jianwang, Lv Xueqiang, Zhang Leihan . Short Text Classification Based on Semantics and Maximum Matching Degree[J]. Computer Engineering and Design, 2013,34(10):3613-3618.)
[14] 陈燕方 . 基于DDAG-SVM的在线商品评论可信度分类模型[J]. 情报理论与实践, 2017,40(7):132-137.
[14] ( Chen Yanfang . Research on Reliability Classification Model of Online Product Reviews Based on DDAG-SVM[J]. Information Studies: Theory & Application, 2017,40(7):132-137.)
[15] 张浩, 钟敏 . 计算机与现代化[J].计算机与现代化,2019(3):102-106.
[15] ( Zhang Hao, Zhong Min . Chinese Short Text Classification Based on Sentence-LDA Topic Model[J]. Computer and Modernization, 2019(3):102-106.)
[16] 黄沛杰, 王俊东, 柯子烜 , 等. 限定领域口语对话系统中超出领域话语的对话行为识别[J]. 中文信息学报, 2016,30(6):182-189,200.
[16] ( Huang Peijie, Wang Jundong, Ke Zixuan , et al. Dialogue Act Recognition for Out-of-Domain Utterances in Spoken Dialogue System[J]. Journal of Chinese Information Processing, 2016,30(6):182-189,200.)
[17] 韩栋, 王春华, 肖敏 . 基于句子级学习改进CNN的短文本分类方法[J]. 计算机工程与设计, 2019,40(1):264-268,292.
[17] ( Han Dong, Wang Chunhua, Xiao Min . Improved CNN Based on Sentence-Level Supervised Learning for Short Text Classification[J]. Computer Engineering and Design, 2019,40(1):264-268,292.)
[18] 刘敬学, 孟凡荣, 周勇 , 等. 字符级卷积神经网络短文本分类算法[J]. 计算机工程与应用, 2019,55(5):135-142.
[18] ( Liu Jingxue, Meng Fanrong, Zhou Yong , et al. Character-Level Convolutional Neural Networks for Short Text Classification[J]. Computer Engineering and Applications, 2019,55(5):135-142.)
[19] 高元 . 面向个性化推荐的海量学术资源分类研究[D]. 宁波:宁波大学, 2017.
[19] ( Gao Yuan . Massive Academic Resources Classification Research for Personalized Recommender[D]. Ningbo: Ningbo University, 2017.)
[20] 朱青, 卫柯臻, 丁兰琳 , 等. 基于文本挖掘和自动分类的法院裁判决策支持系统设计[J]. 中国管理科学, 2018,26(1):170-178.
[20] ( Zhu Qing, Wei Kezhen, Ding Lanlin , et al. Count Judgement Decision System Based on Text-mining and Machine Learning[J]. Chinese Journal of Management Science, 2018,26(1):170-178.)
[21] 施瑞朗 . 基于社交平台数据的文本分类算法研究[J]. 电子科技, 2018,31(10):69-70,75.
[21] ( Shi Ruilang . Text Categorization Algorithm Based on Social Platform Data[J]. Electronic Science and Technology, 2018,31(10):69-70,75.)
[22] Le Q, Mikolov T. Distributed Representations of Sentences and Documents [C]// Proceedings of the 31st International Conference on Machine Learning. 2014: 1188-1196.
[23] 陈晓美, 高铖, 关心惠 . 网络舆情观点提取的LDA主题模型方法[J]. 图书情报工作, 2015,59(21):21-26.
[23] ( Chen Xiaomei, Gao Cheng, Guan Xinhui . Extraction Method of Network Public Opinion Based on LDA Topic Model[J]. Library and Information Service, 2015,59(21):21-26.)
[24] 杨宇婷, 王名扬, 田宪允 , 等. 基于文档分布式表达的新浪微博情感分类研究[J]. 情报杂志, 2016,35(2):151-156.
[24] ( Yang Yuting, Wang Mingyang, Tian Xianyun , et al. Sina Microblog Sentiment Classification Based on Distributed Representation of Documents[J]. Journal of Intelligence, 2016,35(2):151-156.)
[25] Cortes C, Vapnik V . Support-Vector Networks[J]. Machine Learning, 1995,20(3):273-297.
[26] 周志华 . 机器学习[M]. 北京: 清华大学出版社, 2016.
[26] ( Zhou Zhihua. Machine Learning[M]. Beijing: Tsinghua University Press, 2016.)
[27] Cutler A, Cutler D R, Stevens J R . Random Forests[A]// Zhang C, Ma Y. Ensemble Machine Learning[M]. Springer, 2004: 157-176.
[28] 余本功, 陈杨楠, 杨颖 . 基于nBD-SVM模型的投诉短文本分类[J]. 数据分析与知识发现, 2019,3(5):77-85.
[28] ( Yu Bengong, Chen Yangnan, Yang Ying . Classifying Short Text Complaints with nBD-SVM Model[J]. Data Analysis and Knowledge Discovery, 2019,3(5):77-85.)
[1] Gang Li,Huayang Zhou,Jin Mao,Sijing Chen. Classifying Social Media Users with Machine Learning[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
[2] Guo Chen,Tianxiang Xu. Sentence Function Recognition Based on Active Learning[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[3] Huiying Qi,Yuhe Jiang. Predicting Breast Cancer Survival Length with Multi-Omics Data Fusion[J]. 数据分析与知识发现, 2019, 3(8): 88-93.
[4] Wancheng Chen,Haoran Dai,Yinghan Jin. Appraising Home Prices with HEDONIC Model: Case Study of Seattle, U.S.[J]. 数据分析与知识发现, 2019, 3(5): 19-26.
[5] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[6] Lianjie Xiao,Mengrui Gao,Xinning Su. An Under-sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data[J]. 数据分析与知识发现, 2019, 3(4): 90-96.
[7] Zhiyong Tao,Xiaobing Li,Ying Liu,Xiaofang Liu. Classifying Short Texts with Improved-Attention Based Bidirectional Long Memory Network[J]. 数据分析与知识发现, 2019, 3(12): 21-29.
[8] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[9] Xinlei Li,Hao Wang,Xiaomin Liu,Sanhong Deng. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[10] Cheng Zhou,Hongqin Wei. Identifying Crowd Participants with Modified Random Forests Algorithm[J]. 数据分析与知识发现, 2018, 2(7): 46-54.
[11] Yuan Chen,Chaoqun Wang,Zhongyi Hu,Jiang Wu. Identifying Malicious Websites with PCA and Random Forest Methods[J]. 数据分析与知识发现, 2018, 2(4): 71-80.
[12] Jun Hou,Kui Liu,Qianmu Li. Classification Recommendation Based on ESSVM[J]. 数据分析与知识发现, 2018, 2(3): 9-21.
[13] Liyi Zhang,Yiran Li,Xuan Wen. Predicting Repeat Purchase Intention of New Consumers[J]. 数据分析与知识发现, 2018, 2(11): 10-18.
[14] Yang Zhao,Qiqi Li,Yuhan Chen,Wenhang Cao. Examining Consumer Reviews of Overseas Shopping APP with Sentiment Analysis[J]. 数据分析与知识发现, 2018, 2(11): 19-27.
[15] Wei Cao,Can Li,Tingting He,Weidong Zhu. Predicting Credit Risks of P2P Loans in China Based on Ensemble Learning Methods[J]. 数据分析与知识发现, 2018, 2(10): 65-76.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn