Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (2): 72-78    DOI: 10.11925/infotech.2096-3467.2018.0509
Current Issue | Archive | Adv Search |
Study on a Method of Feature Classification Selection Based on χ2 Statistics
Zhanglu Tan,Zhaogang Wang(),Han Hu
School of Management, China University of Mining and Technology, Beijing 100083, China
Download: PDF(1118 KB)   HTML ( 1
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims at improving the application effect by improving χ2 statistics. The deficiency of traditional χ2 statistics could not guarantee the balance of information between categories and influence the classification effect. [Methods] By analyzing the characteristics selection process of traditional χ2 statistics and its limitations, a feature classification selection method based on χ2 statistics was proposed, and the feature words of different classes were selected according to the correlation degree between the feature words and each class. [Results] The effect of the improved method on the text classification effect was compared with the SVM as the classification model. The results showed that the feature classification selection method based on χ2 statistics made the accuracy, the average classification accuracy, the lowest classification accuracy, the stability and the system running time significantly improved. [Limitations] When the number of feature words selected was small, the difference was not obvious before and after improvement. [Conclusions] The method of feature classification selection based on χ2 statistics could effectively improve the stability and generalization performance of the classification model, reduce the fluctuation of classification accuracy and improve the efficiency of classification process.

Key wordsχ2 Statistics      Feature Selection      Text Categorization      Stability     
Received: 07 May 2018      Published: 27 March 2019

Cite this article:

Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics. Data Analysis and Knowledge Discovery, 2019, 3(2): 72-78.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.0509     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I2/72

[1] Yang Y, Liu X.A Re-examination of Text Categorization Methods[C]// Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 1999: 42-49.
[2] Yang Y.An Evaluation of Statistical Approaches to Text Categorization[J].Information Retrieval, 1999, 1(1-2): 69-90.
[3] 路永和, 陈景煌. 混合蛙跳算法在文本分类特征选择优化中的应用[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[3] (Lu Yonghe, Chen Jinghuang.Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. Data Analysis and Knowledge Discovery, 2017, 1(1): 91-101.)
[4] 王东波, 何琳, 黄水清. 基于支持向量机的先秦诸子典籍自动分类研究[J]. 图书情报工作, 2017, 61(12): 71-76.
[4] (Wang Dongbo, He Lin, Huang Shuiqing.Research of Automatic Classification for Pre-Qin Philosophers Literature Based on the Support Vector Machine[J]. Library and Information Service, 2017, 61(12): 71-76.)
[5] 胡韧奋, 诸雨辰. 唐诗题材自动分类研究[J]. 北京大学学报: 自然科学版, 2015, 51(2): 262-268.
[5] (Hu Renfen, Zhu Yuchen.Automatic Classification of Tang Poetry Themes[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2015, 51(2): 262-268.)
[6] Meesad P, Boonrawd P, Nuipian V.A Chi-square-test for Word Importance Differentiation in Text Classification[C]// Proceedings of the 2011 International Conference on Information and Electronics Engineering. 2011.
[7] 王光, 邱云飞, 史庆伟. 集合CHI与IG的特征选择方法[J]. 计算机应用研究, 2012, 29(7): 2454-2456.
[7] (Wang Guang, Qiu Yunfei, Shi Qingwei.Collective CHI and IG Feature Selection Method[J]. Application Research of Computers, 2012, 29(7): 2454-2456.)
[8] Dai L, Hu J, Liu W.Using Modified CHI Square and Rough Set for Text Categorization with Many Redundant Features[C]// Proceedings of the 2008 International Symposium on Computational Intelligence & Design. 2008.
[9] Galavotti L, Sebastiani F, Simi M, et al.Feature Selection and Negative Evidence in Automated Text Categorization[C]// Proceedings of the ACM KDD Workshop on Text Mining. 2000.
[10] Jin C, Ma T, Hou R, et al.Chi-square Statistics Feature Selection Based on Term Frequency and Distribution for Text Categorization[J]. IETE Journal of Research, 2015, 61(4): 351-362.
[11] 闫健卓, 李鹏英, 方丽英, 等. 基于χ2统计的改进文本特征选择方法[J]. 计算机工程与设计, 2016, 37(5): 1391-1394.
[11] (Yan Jianzhuo, Li Pengying, Fang Liying, et al.Improved Method for Text Feature Selection Based on CHI[J]. Computer Engineering and Design, 2016, 37(5): 1391-1394.)
[12] 熊忠阳, 张鹏招, 张玉芳. 基于χ2统计的文本分类特征选择方法的研究[J]. 计算机应用, 2008, 28(2): 513-518.
[12] (Xiong Zhongyang, Zhang Pengzhao, Zhang Yufang.Improved Approach to CHI in Feature Extraction[J]. Computer Applications, 2008, 28(2): 513-518.)
[13] 刘海峰, 苏展, 刘守生. 一种基于词频信息的改进CHI文本特征选择[J]. 计算机工程与应用, 2013, 49(22): 110-114.
[13] (Liu Haifeng, Su Zhan, Liu Shousheng.Improved CHI Text Feature Selection Based on Word Frequency Information[J]. Computer Engineering and Applications, 2013, 49(22): 110-114.)
[14] 肖婷, 唐雁. 改进的χ2统计文本特征选择方法[J]. 计算机工程与应用, 2009, 45(14): 136-140.
[14] (Xiao Ting, Tang Yan.Improved χ2 Statistics Method for Text Feature Selection[J]. Computer Engineering and Applications, 2009, 45(14): 136-137.)
[15] Li Y, Luo C, Chung S M.Text Clustering with Feature Selection by Using Statistical Data[J]. IEEE Transactions on Knowledge and Data Engineering, 2008, 20(5): 641-652.
[16] 张辉宜, 谢业名, 袁志祥, 等. 一种基于概率的卡方特征选择方法[J]. 计算机工程, 2016, 42(8): 194-198, 205.
[16] (Zhang Huiyi, Xie Yeming, Yuan Zhixiang, et al.A Method of CHI-square Feature Selection Based on Probability[J]. Computer Engineering, 2016, 42(8): 194-198, 205. )
[17] 裴英博, 刘晓霞. 文本分类中改进型CHI特征选择方法的研究[J]. 计算机工程与应用, 2011, 47(4): 128-130.
[17] (Pei Yingbo, Liu Xiaoxia.Study on Improved CHI for Feature Selection in Chinese Text Categorization[J]. Computer Engineering and Applications, 2011, 47(4): 128-130.)
[18] 李平, 戴月明, 王艳. 基于混合卡方统计量与逻辑回归的文本情感分析[J]. 计算机工程, 2017, 43(12): 192-196, 202.
[18] (Li Ping, Dai Yueming, Wang Yan.Text Sentiment Analysis Based on Hybrid Chi-square Statistic and Logistic Regression[J]. Computer Engineering, 2017, 43(12): 192-196, 202.)
[19] 邱云飞, 王威, 刘大有, 等. 基于方差的CHI特征选择方法[J]. 计算机应用研究, 2012, 29(4): 1304-1306.
[19] (Qiu Yunfei, Wang Wei, Liu Dayou, et al.CHI Feature Selection Method Based on Variance[J]. Application Research of Computers, 2012, 29(4): 1304-1306.)
[20] 徐明, 高翔, 许志刚, 等. 基于改进卡方统计的微博特征提取方法[J]. 计算机工程与应用, 2014, 50(19): 113-117, 142.
[20] (Xu Ming, Gao Xiang, Xu Zhigang, et al.Feature Selection Methods of Microblogging Based on Improved CHI-square Statistics[J]. Computer Engineering and Applications, 2014, 50(19): 113-117, 142.)
[21] 史峰, 王辉, 郁磊, 等. MATLAB智能算法30个案例分析[M]. 第一版. 北京: 北京航空航天大学出版社, 2011: 275-278.
[21] (Shi Feng, Wang Hui, Yu Lei, et al.30 Cases Analysis of MATLAB Intelligent Algorithm[M]. The 1st Edition. Beijing: BeiHang University Press, 2011: 275-278.)
[1] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[2] Jiaming Liang,Jie Zhao,Zhou Jianlong,Zhenning Dong. Detecting Collusive Fraudulent Online Transaction with Implicit User Behaviors[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
[3] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[4] Xiangdong Li,Fan Gao,Youhai Li. Categorizing Documents Automatically within Common Semantic Space[J]. 数据分析与知识发现, 2018, 2(9): 66-73.
[5] Tingxin Wen,Yangzi Li,Jingshuang Sun. Extracting Text Features with Improved Fruit Fly Optimization Algorithm[J]. 数据分析与知识发现, 2018, 2(5): 59-69.
[6] Guoming Feng,Xiaodong Zhang,Suhui Liu. Classifying Chinese Texts with CapsNet[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[7] Zhipeng Li,Weizhong Li. Feature Selection Based on Modified QPSO Algorithm[J]. 数据分析与知识发现, 2017, 1(7): 82-89.
[8] Yue Zhang,Dongbo Wang,Danhao Zhu. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[9] Xiangdong Li,Tao Ruan,Kang Liu. Automatic Classification of Documents from Wikipedia[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[10] Yonghe Lu,Jinghuang Chen. Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[11] Liu Hongguang,Ma Shuanggang,Liu Guifeng. Classifying Chinese News Texts with Denoising Auto Encoder[J]. 现代图书情报技术, 2016, 32(6): 12-19.
[12] Meng Yuan,Wang Hongwei. Evaluating Online Reviews Based on Text Content Features[J]. 现代图书情报技术, 2016, 32(4): 40-47.
[13] Li Xiangdong, Ba Zhichao, Huang Li. Allocation and Multi-granularity[J]. 现代图书情报技术, 2015, 31(5): 42-49.
[14] Xu Dongdong, Wu Shaobo. An Improved TF-IDF Feature Selection Based on Categorical Description[J]. 现代图书情报技术, 2015, 31(3): 39-48.
[15] Tan Xueqing, Zhou Tong, Luo Lin. A Text Classification Algorithm Based on the Average Category Similarity[J]. 现代图书情报技术, 2014, 30(9): 66-73.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn