Please wait a minute...
New Technology of Library and Information Service  2016, Vol. 32 Issue (6): 12-19    DOI: 10.11925/infotech.1003-3513.2016.06.02
Orginal Article Current Issue | Archive | Adv Search |
Classifying Chinese News Texts with Denoising Auto Encoder
Liu Hongguang,Ma Shuanggang(),Liu Guifeng
Institute of Scientific & Technical Information, Jiangsu University, Zhenjiang 212013, China
Export: BibTeX | EndNote (RIS)      

[Objective] This paper proposes a new method to improve the classification accuracy of the Chinese news texts with the help of Deep Learning theory. [Methods] We first used the denoising auto encoder to construct a deep network to learn the zipped and distributed representation of the Chinese news texts. Second, we used the SVM algorithm to classify these news texts. [Results] As the number of samples expanding, the precision rate, the recall rate and the F value of the proposed method increased too. The results are better than those of the applications using the KNN, BP and SVM algorithms. The average precision rate was higher than 95%. [Limitations] The data size was relatively small, thus, the proposed method did not fully utilize the parallel data processing capacity of the deep learning technology. [Conclusions] The proposed method improves the performance of applications classifying Chinese news texts.

Key wordsDAE      SVM      Feature selection      Document classification     
Received: 13 January 2016      Published: 18 July 2016

Cite this article:

Liu Hongguang,Ma Shuanggang,Liu Guifeng. Classifying Chinese News Texts with Denoising Auto Encoder. New Technology of Library and Information Service, 2016, 32(6): 12-19.

URL:     OR

[1] 裴英博, 刘晓霞. 文本分类中改进型CHI特征选择方法的研究[J]. 计算机工程与应用, 2011, 47(4): 128-130.
[1] (Pei Yingbo, Liu Xiaoxia.Study on Improved CHI for Feature Selection in Chinese Text Categorization[J]. Computer Engineering and Applications, 2011, 47(4): 128-130.)
[2] 辛竹, 周亚建. 文本分类中互信息特征选择方法的研究与算法改进[J]. 计算机应用, 2013, 33(S2): 116-118, 152.
[2] (Xin Zhu, Zhou Yajian.Study and Improvement of Mutual Information for Feature Selection in Text Categorization[J]. Journal of Computer Applications, 2013, 33(S2): 116-118, 152.)
[3] 郭颂, 马飞. 文本分类中信息增益特征选择算法的改进[J]. 计算机应用与软件, 2013, 30(8): 139-142.
[3] (Guo Song, Ma Fei.Improving the Algorithm of Information Gain Feature Selection in Text Classification[J]. Computer Applications and Software, 2013, 30(8): 139-142.)
[4] Peters C, Koster C H.Uncertainty-based Noise Reduction and Term Selection in Text Categorization [C]. In: Proceedings of the 24th BCS-IRSG European Colloquium on IR Research, Glasgow, UK. Springer, 2002: 248-267.
[5] Lewis D D.Representation and Learning in Information Retrieval [D]. University of Massachusetts, 1992.
[6] 李学相. 改进的最大熵权值算法在文本分类中的应用[J]. 计算机科学, 2012, 39(6): 210-212.
[6] (Li Xuexiang.Research of Text Categorization Based on Improved Maximum Entropy Algorithm[J]. Computer Science, 2012, 39(6): 210-212.)
[7] Hinton G E, Salakhutdinov R R.Reducing the Dimensionality of Data with Neural Networks[J]. Science, 2006, 313(5786): 504-507.
[8] Bengio Y, Lamblin P, Popovici D, et al.Greedy Layer-wise Training of Deep Networks [C]. In: Proceedings of the 20th Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada. 2007, 19: 153.
[9] Vincent P, Larochelle H, Bengio Y, et al.Extracting and Composing Robust Features with Denoising Autoencoders [C]. In: Proceedings of the 25th International Conference on Machine Learning. ACM, 2008: 1096-1103.
[10] Masci J, Meier U, Cire?an D, et al.Stacked Convolutional Auto-encoders for Hierarchical Feature Extraction [C]. In: Proceedings of the 21st International Conference on Artificial Neural Networks. Springer Berlin Heidelberg, 2011: 52-59.
[11] 汪彩霞, 魏雪云, 王彪. 基于堆栈降噪自动编码模型的动态纹理分类方法[J]. 现代电子技术, 2015, 38(6): 20-24.
[11] (Wang Caixia, Wei Xueyun, Wang Biao.Dynamic Texture Classification Method Based on Stacked Denoising Autoencoding Model[J]. Modern Electronics Technique, 2015, 38(6): 20-24.)
[12] Wu Z, Takaki S, Yamagishi J. Deep Denoising Auto-encoder for Statistical Speech Synthesis [OL]. arXiv:1506.05268, 2015.
[13] Li J, Struzik Z, Zhang L, et al.Feature Learning from Incomplete EEG with Denoising Autoencoder[J]. Neurocomputing, 2015, 165: 23-31.
[14] 胡帅, 袁志勇, 肖玲, 等. 基于改进的多层降噪自编码算法临床分类诊断研究[J]. 计算机应用研究, 2015, 32(5): 1417-1420.
[14] (Hu Shuai, Yuan Zhiyong, Xiao Ling, et al.Stacked Denoising Autoencoders Applied to Clinical Diagnose and Classification[J]. Application Research of Computers, 2015, 32(5): 1417-1420.)
[15] 刘勘, 袁蕴英. 基于自动编码器的短文本特征提取及聚类研究[J]. 北京大学学报: 自然科学版, 2015, 51(2): 282-288.
[15] (Liu Kan, Yuan Yunying.Short Texts Feature Extraction and Clustering Based on Auto-Encoder[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2015, 51(2): 282-288.)
[16] 秦胜君, 卢志平. 基于降噪自动编码器的不平衡情感分类研究[J]. 科学技术与工程, 2014, 14(12): 232-235.
[16] (Qin Shengjun, Lu Zhiping.Research of Unbalance Sentiment Classification Based on Denoising Autoencoders[J]. Science Technology and Engineering, 2014, 51(12): 232-235.)
[17] Bengio Y, Delalleau O.On the Expressive Power of Deep Architectures [C]. In: Proceedings of the 22nd International Conference on Algorithmic Learning Theory. Springer Berlin Heidelberg, 2011: 18-36.
[18] Vincent P, Larochelle H, Bengio Y, et al.Extracting and Composing Robust Features with Denoising Autoencoders [C]. In: Proceedings of the 25th International Conference on Machine Learning. ACM, 2008: 1096-1103.
[19] Neural Networks and Deep Learning [EB/OL]. [2015-12-23]. .
[20] Vapnik V N.The Nature of Statistical Learning Theory[J]. IEEE Transactions on Neural Networks, 1995, 10(5): 988-999.
[21] NLPIR汉语分词系统[EB/OL]. [2015-09-22]. .
[21] (NLPIR Chinese Word Segmentation System [EB/OL]. [2015-09-22].
[22] 文本分类语料库(复旦)测试语料 [EB/OL]. [2015-12-24]. .
[22] (Text Categorization Corpus (Fudan) Test Corpus [EB/OL]. [2015- 12-24].
[1] Liang Jiaming, Zhao Jie, Zheng Peng, Huang Liushen, Ye Minqi, Dong Zhenning. Framework for Computing Trust in Online Short-Rent Platform Using Feature Selection of Images and Texts[J]. 数据分析与知识发现, 2021, 5(2): 129-140.
[2] Shen Wang, Li Shiyu, Liu Jiayu, Li He. Optimizing Quality Evaluation for Answers of Q&A Community[J]. 数据分析与知识发现, 2021, 5(2): 83-93.
[3] Gong Lijuan,Wang Hao,Zhang Zixuan,Zhu Liping. Reducing Dimensions of Custom Declaration Texts with Word2Vec[J]. 数据分析与知识发现, 2020, 4(2/3): 89-100.
[4] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[5] Gang Li,Huayang Zhou,Jin Mao,Sijing Chen. Classifying Social Media Users with Machine Learning[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
[6] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[7] Jiaming Liang,Jie Zhao,Zhou Jianlong,Zhenning Dong. Detecting Collusive Fraudulent Online Transaction with Implicit User Behaviors[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
[8] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[9] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[10] Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[11] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[12] Li Lin,Li Hui. Computing Text Similarity Based on Concept Vector Space[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[13] Wen Tingxin,Li Yangzi,Sun Jingshuang. Extracting Text Features with Improved Fruit Fly Optimization Algorithm[J]. 数据分析与知识发现, 2018, 2(5): 59-69.
[14] Hou Jun,Liu Kui,Li Qianmu. Classification Recommendation Based on ESSVM[J]. 数据分析与知识发现, 2018, 2(3): 9-21.
[15] Zhao Yang,Li Qiqi,Chen Yuhan,Cao Wenhang. Examining Consumer Reviews of Overseas Shopping APP with Sentiment Analysis[J]. 数据分析与知识发现, 2018, 2(11): 19-27.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938