Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (6): 12-19     https://doi.org/10.11925/infotech.1003-3513.2016.06.02
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于降噪自动编码器的中文新闻文本分类方法研究*
刘红光,马双刚(),刘桂锋
江苏大学科技信息研究所 镇江 212013
Classifying Chinese News Texts with Denoising Auto Encoder
Liu Hongguang,Ma Shuanggang(),Liu Guifeng
Institute of Scientific & Technical Information, Jiangsu University, Zhenjiang 212013, China
全文: PDF (576 KB)   HTML ( 74
输出: BibTeX | EndNote (RIS)      
摘要 

目的】借助深度学习理论, 解决传统特征选择方法容易导致特征项不明确、分类精度下降的问题。【方法】对中文新闻文本进行分类时, 使用降噪自动编码器构建一个深层网络来学习对文本的压缩及分布式的表示, 并在网络最后一层采用SVM算法将其分类到具体的类别中去。【结果】随着样本数目的增大, 分类准确率、召回率和F值都在上升, 且比KNN算法、BP算法和SVM算法取得了更优的分类效果, 平均分类准确率达到95%以上。【局限】数据量依然较小, 且并没有完全发挥深度学习并行处理大容量数据的优势。【结论】该方法能提高特征项提取的准确性, 并能提高分类效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
马双刚
刘红光
刘桂锋
关键词 降噪自动编码器支持向量机特征提取文本分类    
Abstract

[Objective] This paper proposes a new method to improve the classification accuracy of the Chinese news texts with the help of Deep Learning theory. [Methods] We first used the denoising auto encoder to construct a deep network to learn the zipped and distributed representation of the Chinese news texts. Second, we used the SVM algorithm to classify these news texts. [Results] As the number of samples expanding, the precision rate, the recall rate and the F value of the proposed method increased too. The results are better than those of the applications using the KNN, BP and SVM algorithms. The average precision rate was higher than 95%. [Limitations] The data size was relatively small, thus, the proposed method did not fully utilize the parallel data processing capacity of the deep learning technology. [Conclusions] The proposed method improves the performance of applications classifying Chinese news texts.

Key wordsDAE    SVM    Feature selection    Document classification
收稿日期: 2016-01-13      出版日期: 2016-07-18
基金资助:*本文系教育部人文社会科学研究青年基金项目“基于超图模型的专利文本多标签分类研究”(项目编号:14YJC870014)的研究成果之一
引用本文:   
刘红光,马双刚,刘桂锋. 基于降噪自动编码器的中文新闻文本分类方法研究*[J]. 现代图书情报技术, 2016, 32(6): 12-19.
Liu Hongguang,Ma Shuanggang,Liu Guifeng. Classifying Chinese News Texts with Denoising Auto Encoder. New Technology of Library and Information Service, 2016, 32(6): 12-19.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.06.02      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2016/V32/I6/12
[1] 裴英博, 刘晓霞. 文本分类中改进型CHI特征选择方法的研究[J]. 计算机工程与应用, 2011, 47(4): 128-130.
[1] (Pei Yingbo, Liu Xiaoxia.Study on Improved CHI for Feature Selection in Chinese Text Categorization[J]. Computer Engineering and Applications, 2011, 47(4): 128-130.)
[2] 辛竹, 周亚建. 文本分类中互信息特征选择方法的研究与算法改进[J]. 计算机应用, 2013, 33(S2): 116-118, 152.
[2] (Xin Zhu, Zhou Yajian.Study and Improvement of Mutual Information for Feature Selection in Text Categorization[J]. Journal of Computer Applications, 2013, 33(S2): 116-118, 152.)
[3] 郭颂, 马飞. 文本分类中信息增益特征选择算法的改进[J]. 计算机应用与软件, 2013, 30(8): 139-142.
[3] (Guo Song, Ma Fei.Improving the Algorithm of Information Gain Feature Selection in Text Classification[J]. Computer Applications and Software, 2013, 30(8): 139-142.)
[4] Peters C, Koster C H.Uncertainty-based Noise Reduction and Term Selection in Text Categorization [C]. In: Proceedings of the 24th BCS-IRSG European Colloquium on IR Research, Glasgow, UK. Springer, 2002: 248-267.
[5] Lewis D D.Representation and Learning in Information Retrieval [D]. University of Massachusetts, 1992.
[6] 李学相. 改进的最大熵权值算法在文本分类中的应用[J]. 计算机科学, 2012, 39(6): 210-212.
[6] (Li Xuexiang.Research of Text Categorization Based on Improved Maximum Entropy Algorithm[J]. Computer Science, 2012, 39(6): 210-212.)
[7] Hinton G E, Salakhutdinov R R.Reducing the Dimensionality of Data with Neural Networks[J]. Science, 2006, 313(5786): 504-507.
[8] Bengio Y, Lamblin P, Popovici D, et al.Greedy Layer-wise Training of Deep Networks [C]. In: Proceedings of the 20th Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada. 2007, 19: 153.
[9] Vincent P, Larochelle H, Bengio Y, et al.Extracting and Composing Robust Features with Denoising Autoencoders [C]. In: Proceedings of the 25th International Conference on Machine Learning. ACM, 2008: 1096-1103.
[10] Masci J, Meier U, Cire?an D, et al.Stacked Convolutional Auto-encoders for Hierarchical Feature Extraction [C]. In: Proceedings of the 21st International Conference on Artificial Neural Networks. Springer Berlin Heidelberg, 2011: 52-59.
[11] 汪彩霞, 魏雪云, 王彪. 基于堆栈降噪自动编码模型的动态纹理分类方法[J]. 现代电子技术, 2015, 38(6): 20-24.
[11] (Wang Caixia, Wei Xueyun, Wang Biao.Dynamic Texture Classification Method Based on Stacked Denoising Autoencoding Model[J]. Modern Electronics Technique, 2015, 38(6): 20-24.)
[12] Wu Z, Takaki S, Yamagishi J. Deep Denoising Auto-encoder for Statistical Speech Synthesis [OL]. arXiv:1506.05268, 2015.
[13] Li J, Struzik Z, Zhang L, et al.Feature Learning from Incomplete EEG with Denoising Autoencoder[J]. Neurocomputing, 2015, 165: 23-31.
[14] 胡帅, 袁志勇, 肖玲, 等. 基于改进的多层降噪自编码算法临床分类诊断研究[J]. 计算机应用研究, 2015, 32(5): 1417-1420.
[14] (Hu Shuai, Yuan Zhiyong, Xiao Ling, et al.Stacked Denoising Autoencoders Applied to Clinical Diagnose and Classification[J]. Application Research of Computers, 2015, 32(5): 1417-1420.)
[15] 刘勘, 袁蕴英. 基于自动编码器的短文本特征提取及聚类研究[J]. 北京大学学报: 自然科学版, 2015, 51(2): 282-288.
[15] (Liu Kan, Yuan Yunying.Short Texts Feature Extraction and Clustering Based on Auto-Encoder[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2015, 51(2): 282-288.)
[16] 秦胜君, 卢志平. 基于降噪自动编码器的不平衡情感分类研究[J]. 科学技术与工程, 2014, 14(12): 232-235.
[16] (Qin Shengjun, Lu Zhiping.Research of Unbalance Sentiment Classification Based on Denoising Autoencoders[J]. Science Technology and Engineering, 2014, 51(12): 232-235.)
[17] Bengio Y, Delalleau O.On the Expressive Power of Deep Architectures [C]. In: Proceedings of the 22nd International Conference on Algorithmic Learning Theory. Springer Berlin Heidelberg, 2011: 18-36.
[18] Vincent P, Larochelle H, Bengio Y, et al.Extracting and Composing Robust Features with Denoising Autoencoders [C]. In: Proceedings of the 25th International Conference on Machine Learning. ACM, 2008: 1096-1103.
[19] Neural Networks and Deep Learning [EB/OL]. [2015-12-23]. .
[20] Vapnik V N.The Nature of Statistical Learning Theory[J]. IEEE Transactions on Neural Networks, 1995, 10(5): 988-999.
[21] NLPIR汉语分词系统[EB/OL]. [2015-09-22]. .
[21] (NLPIR Chinese Word Segmentation System [EB/OL]. [2015-09-22].
[22] 文本分类语料库(复旦)测试语料 [EB/OL]. [2015-12-24]. .
[22] (Text Categorization Corpus (Fudan) Test Corpus [EB/OL]. [2015- 12-24].
[1] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] 余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[4] 沈旺, 李世钰, 刘嘉宇, 李贺. 问答社区回答质量评价体系优化方法研究 *[J]. 数据分析与知识发现, 2021, 5(2): 83-93.
[5] 冯昊, 李树青. 基于多种支持向量机的多层级联式分类器研究及其在信用评分中的应用*[J]. 数据分析与知识发现, 2021, 5(10): 28-36.
[6] 郑新曼, 董瑜. 基于科技政策文本的程度词典构建研究*[J]. 数据分析与知识发现, 2021, 5(10): 81-93.
[7] 王艳, 王胡燕, 余本功. 基于多特征融合的中文文本分类研究*[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[8] 唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[9] 王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[10] 徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[11] 丁晟春,俞沣洋,李真. 网络舆情潜在热点主题识别研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 29-38.
[12] 龚丽娟,王昊,张紫玄,朱立平. Word2Vec对海关报关商品文本特征降维效果分析*[J]. 数据分析与知识发现, 2020, 4(2/3): 89-100.
[13] 蔡婧璇,吴江,王诚坤. 基于深度学习的众测报告有用性预测研究*[J]. 数据分析与知识发现, 2020, 4(11): 102-111.
[14] 徐彤彤,孙华志,马春梅,姜丽芬,刘逸琛. 基于双向长效注意力特征表达的少样本文本分类模型研究*[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[15] 余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn