Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (6): 12-19    DOI: 10.11925/infotech.1003-3513.2016.06.02
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于降噪自动编码器的中文新闻文本分类方法研究*
刘红光,马双刚(),刘桂锋
江苏大学科技信息研究所 镇江 212013
Classifying Chinese News Texts with Denoising Auto Encoder
Liu Hongguang,Ma Shuanggang(),Liu Guifeng
Institute of Scientific & Technical Information, Jiangsu University, Zhenjiang 212013, China
全文: PDF(576 KB)   HTML ( 69
输出: BibTeX | EndNote (RIS)      
摘要 

目的】借助深度学习理论, 解决传统特征选择方法容易导致特征项不明确、分类精度下降的问题。【方法】对中文新闻文本进行分类时, 使用降噪自动编码器构建一个深层网络来学习对文本的压缩及分布式的表示, 并在网络最后一层采用SVM算法将其分类到具体的类别中去。【结果】随着样本数目的增大, 分类准确率、召回率和F值都在上升, 且比KNN算法、BP算法和SVM算法取得了更优的分类效果, 平均分类准确率达到95%以上。【局限】数据量依然较小, 且并没有完全发挥深度学习并行处理大容量数据的优势。【结论】该方法能提高特征项提取的准确性, 并能提高分类效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
马双刚
刘红光
刘桂锋
关键词 降噪自动编码器支持向量机特征提取文本分类    
Abstract

[Objective] This paper proposes a new method to improve the classification accuracy of the Chinese news texts with the help of Deep Learning theory. [Methods] We first used the denoising auto encoder to construct a deep network to learn the zipped and distributed representation of the Chinese news texts. Second, we used the SVM algorithm to classify these news texts. [Results] As the number of samples expanding, the precision rate, the recall rate and the F value of the proposed method increased too. The results are better than those of the applications using the KNN, BP and SVM algorithms. The average precision rate was higher than 95%. [Limitations] The data size was relatively small, thus, the proposed method did not fully utilize the parallel data processing capacity of the deep learning technology. [Conclusions] The proposed method improves the performance of applications classifying Chinese news texts.

Key wordsDAE    SVM    Feature selection    Document classification
收稿日期: 2016-01-13     
基金资助:*本文系教育部人文社会科学研究青年基金项目“基于超图模型的专利文本多标签分类研究”(项目编号:14YJC870014)的研究成果之一
引用本文:   
刘红光,马双刚,刘桂锋. 基于降噪自动编码器的中文新闻文本分类方法研究*[J]. 现代图书情报技术, 2016, 32(6): 12-19.
Liu Hongguang,Ma Shuanggang,Liu Guifeng. Classifying Chinese News Texts with Denoising Auto Encoder. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2016.06.02.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.06.02
[1] 裴英博, 刘晓霞. 文本分类中改进型CHI特征选择方法的研究[J]. 计算机工程与应用, 2011, 47(4): 128-130.
[1] (Pei Yingbo, Liu Xiaoxia.Study on Improved CHI for Feature Selection in Chinese Text Categorization[J]. Computer Engineering and Applications, 2011, 47(4): 128-130.)
[2] 辛竹, 周亚建. 文本分类中互信息特征选择方法的研究与算法改进[J]. 计算机应用, 2013, 33(S2): 116-118, 152.
[2] (Xin Zhu, Zhou Yajian.Study and Improvement of Mutual Information for Feature Selection in Text Categorization[J]. Journal of Computer Applications, 2013, 33(S2): 116-118, 152.)
[3] 郭颂, 马飞. 文本分类中信息增益特征选择算法的改进[J]. 计算机应用与软件, 2013, 30(8): 139-142.
[3] (Guo Song, Ma Fei.Improving the Algorithm of Information Gain Feature Selection in Text Classification[J]. Computer Applications and Software, 2013, 30(8): 139-142.)
[4] Peters C, Koster C H.Uncertainty-based Noise Reduction and Term Selection in Text Categorization [C]. In: Proceedings of the 24th BCS-IRSG European Colloquium on IR Research, Glasgow, UK. Springer, 2002: 248-267.
[5] Lewis D D.Representation and Learning in Information Retrieval [D]. University of Massachusetts, 1992.
[6] 李学相. 改进的最大熵权值算法在文本分类中的应用[J]. 计算机科学, 2012, 39(6): 210-212.
[6] (Li Xuexiang.Research of Text Categorization Based on Improved Maximum Entropy Algorithm[J]. Computer Science, 2012, 39(6): 210-212.)
[7] Hinton G E, Salakhutdinov R R.Reducing the Dimensionality of Data with Neural Networks[J]. Science, 2006, 313(5786): 504-507.
[8] Bengio Y, Lamblin P, Popovici D, et al.Greedy Layer-wise Training of Deep Networks [C]. In: Proceedings of the 20th Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada. 2007, 19: 153.
[9] Vincent P, Larochelle H, Bengio Y, et al.Extracting and Composing Robust Features with Denoising Autoencoders [C]. In: Proceedings of the 25th International Conference on Machine Learning. ACM, 2008: 1096-1103.
[10] Masci J, Meier U, Cire?an D, et al.Stacked Convolutional Auto-encoders for Hierarchical Feature Extraction [C]. In: Proceedings of the 21st International Conference on Artificial Neural Networks. Springer Berlin Heidelberg, 2011: 52-59.
[11] 汪彩霞, 魏雪云, 王彪. 基于堆栈降噪自动编码模型的动态纹理分类方法[J]. 现代电子技术, 2015, 38(6): 20-24.
[11] (Wang Caixia, Wei Xueyun, Wang Biao.Dynamic Texture Classification Method Based on Stacked Denoising Autoencoding Model[J]. Modern Electronics Technique, 2015, 38(6): 20-24.)
[12] Wu Z, Takaki S, Yamagishi J. Deep Denoising Auto-encoder for Statistical Speech Synthesis [OL]. arXiv:1506.05268, 2015.
[13] Li J, Struzik Z, Zhang L, et al.Feature Learning from Incomplete EEG with Denoising Autoencoder[J]. Neurocomputing, 2015, 165: 23-31.
[14] 胡帅, 袁志勇, 肖玲, 等. 基于改进的多层降噪自编码算法临床分类诊断研究[J]. 计算机应用研究, 2015, 32(5): 1417-1420.
[14] (Hu Shuai, Yuan Zhiyong, Xiao Ling, et al.Stacked Denoising Autoencoders Applied to Clinical Diagnose and Classification[J]. Application Research of Computers, 2015, 32(5): 1417-1420.)
[15] 刘勘, 袁蕴英. 基于自动编码器的短文本特征提取及聚类研究[J]. 北京大学学报: 自然科学版, 2015, 51(2): 282-288.
[15] (Liu Kan, Yuan Yunying.Short Texts Feature Extraction and Clustering Based on Auto-Encoder[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2015, 51(2): 282-288.)
[16] 秦胜君, 卢志平. 基于降噪自动编码器的不平衡情感分类研究[J]. 科学技术与工程, 2014, 14(12): 232-235.
[16] (Qin Shengjun, Lu Zhiping.Research of Unbalance Sentiment Classification Based on Denoising Autoencoders[J]. Science Technology and Engineering, 2014, 51(12): 232-235.)
[17] Bengio Y, Delalleau O.On the Expressive Power of Deep Architectures [C]. In: Proceedings of the 22nd International Conference on Algorithmic Learning Theory. Springer Berlin Heidelberg, 2011: 18-36.
[18] Vincent P, Larochelle H, Bengio Y, et al.Extracting and Composing Robust Features with Denoising Autoencoders [C]. In: Proceedings of the 25th International Conference on Machine Learning. ACM, 2008: 1096-1103.
[19] Neural Networks and Deep Learning [EB/OL]. [2015-12-23]. .
[20] Vapnik V N.The Nature of Statistical Learning Theory[J]. IEEE Transactions on Neural Networks, 1995, 10(5): 988-999.
[21] NLPIR汉语分词系统[EB/OL]. [2015-09-22]. .
[21] (NLPIR Chinese Word Segmentation System [EB/OL]. [2015-09-22].
[22] 文本分类语料库(复旦)测试语料 [EB/OL]. [2015-12-24]. .
[22] (Text Categorization Corpus (Fudan) Test Corpus [EB/OL]. [2015- 12-24].
[1] 文秀贤,徐健. 基于用户评论的商品特征提取及特征价格研究 *[J]. 数据分析与知识发现, 2019, 3(7): 42-51.
[2] 曾庆田,戴明弟,李超,段华,赵中英. 轨迹数据融合用户表示方法的重要位置发现*[J]. 数据分析与知识发现, 2019, 3(6): 75-82.
[3] 周成,魏红芹. 专利价值评估与分类研究*——基于自组织映射支持向量机[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[4] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[5] 谭章禄,王兆刚,胡翰. 一种基于χ2统计的特征分类选择方法研究*[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[6] 杨贵军,徐雪,赵富强. 基于XGBoost算法的用户评分预测模型及应用*[J]. 数据分析与知识发现, 2019, 3(1): 118-126.
[7] 张紫玄,王昊,朱立平,邓三鸿. 中国海关HS编码风险的识别研究*[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[8] 李心蕾,王昊,刘小敏,邓三鸿. 面向微博短文本分类的文本向量化方法比较研究*[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[9] 李琳,李辉. 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[10] 刘浏,王东波. 基于论文自动分类的社科类学科跨学科性研究*[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[11] 侯君,刘魁,李千目. 基于ESSVM的分类推荐*[J]. 数据分析与知识发现, 2018, 2(3): 9-21.
[12] 冯国明,张晓冬,刘素辉. 基于CapsNet的中文文本分类研究*[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[13] 黄孝喜,李晗雨,王荣波,王小华,谌志群. 基于卷积神经网络与SVM分类器的隐喻识别*[J]. 数据分析与知识发现, 2018, 2(10): 77-83.
[14] 李伟卿,王伟军. 基于大规模评论数据的产品特征词典构建方法研究*[J]. 数据分析与知识发现, 2018, 2(1): 41-50.
[15] 李昌兵,庞崇鹏,李美平. 基于权重的Apriori算法在文本统计特征提取方法中的应用*[J]. 数据分析与知识发现, 2017, 1(9): 83-89.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn