[Objective] This paper aims at improving the application effect by improving χ2 statistics. The deficiency of traditional χ2 statistics could not guarantee the balance of information between categories and influence the classification effect. [Methods] By analyzing the characteristics selection process of traditional χ2 statistics and its limitations, a feature classification selection method based on χ2 statistics was proposed, and the feature words of different classes were selected according to the correlation degree between the feature words and each class. [Results] The effect of the improved method on the text classification effect was compared with the SVM as the classification model. The results showed that the feature classification selection method based on χ2 statistics made the accuracy, the average classification accuracy, the lowest classification accuracy, the stability and the system running time significantly improved. [Limitations] When the number of feature words selected was small, the difference was not obvious before and after improvement. [Conclusions] The method of feature classification selection based on χ2 statistics could effectively improve the stability and generalization performance of the classification model, reduce the fluctuation of classification accuracy and improve the efficiency of classification process.
谭章禄,王兆刚,胡翰. 一种基于χ2统计的特征分类选择方法研究*[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics. Data Analysis and Knowledge Discovery, 2019, 3(2): 72-78.
Yang Y, Liu X.A Re-examination of Text Categorization Methods[C]// Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 1999: 42-49.
[2]
Yang Y.An Evaluation of Statistical Approaches to Text Categorization[J].Information Retrieval, 1999, 1(1-2): 69-90.
(Lu Yonghe, Chen Jinghuang.Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. Data Analysis and Knowledge Discovery, 2017, 1(1): 91-101.)
(Wang Dongbo, He Lin, Huang Shuiqing.Research of Automatic Classification for Pre-Qin Philosophers Literature Based on the Support Vector Machine[J]. Library and Information Service, 2017, 61(12): 71-76.)
(Hu Renfen, Zhu Yuchen.Automatic Classification of Tang Poetry Themes[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2015, 51(2): 262-268.)
[6]
Meesad P, Boonrawd P, Nuipian V.A Chi-square-test for Word Importance Differentiation in Text Classification[C]// Proceedings of the 2011 International Conference on Information and Electronics Engineering. 2011.
(Wang Guang, Qiu Yunfei, Shi Qingwei.Collective CHI and IG Feature Selection Method[J]. Application Research of Computers, 2012, 29(7): 2454-2456.)
[8]
Dai L, Hu J, Liu W.Using Modified CHI Square and Rough Set for Text Categorization with Many Redundant Features[C]// Proceedings of the 2008 International Symposium on Computational Intelligence & Design. 2008.
[9]
Galavotti L, Sebastiani F, Simi M, et al.Feature Selection and Negative Evidence in Automated Text Categorization[C]// Proceedings of the ACM KDD Workshop on Text Mining. 2000.
[10]
Jin C, Ma T, Hou R, et al.Chi-square Statistics Feature Selection Based on Term Frequency and Distribution for Text Categorization[J]. IETE Journal of Research, 2015, 61(4): 351-362.
(Yan Jianzhuo, Li Pengying, Fang Liying, et al.Improved Method for Text Feature Selection Based on CHI[J]. Computer Engineering and Design, 2016, 37(5): 1391-1394.)
(Liu Haifeng, Su Zhan, Liu Shousheng.Improved CHI Text Feature Selection Based on Word Frequency Information[J]. Computer Engineering and Applications, 2013, 49(22): 110-114.)
(Xiao Ting, Tang Yan.Improved χ2 Statistics Method for Text Feature Selection[J]. Computer Engineering and Applications, 2009, 45(14): 136-137.)
[15]
Li Y, Luo C, Chung S M.Text Clustering with Feature Selection by Using Statistical Data[J]. IEEE Transactions on Knowledge and Data Engineering, 2008, 20(5): 641-652.
(Pei Yingbo, Liu Xiaoxia.Study on Improved CHI for Feature Selection in Chinese Text Categorization[J]. Computer Engineering and Applications, 2011, 47(4): 128-130.)
(Li Ping, Dai Yueming, Wang Yan.Text Sentiment Analysis Based on Hybrid Chi-square Statistic and Logistic Regression[J]. Computer Engineering, 2017, 43(12): 192-196, 202.)
(Xu Ming, Gao Xiang, Xu Zhigang, et al.Feature Selection Methods of Microblogging Based on Improved CHI-square Statistics[J]. Computer Engineering and Applications, 2014, 50(19): 113-117, 142.)
(Shi Feng, Wang Hui, Yu Lei, et al.30 Cases Analysis of MATLAB Intelligent Algorithm[M]. The 1st Edition. Beijing: BeiHang University Press, 2011: 275-278.)