Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (2): 72-78    DOI: 10.11925/infotech.2096-3467.2018.0509
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
一种基于χ2统计的特征分类选择方法研究*
谭章禄,王兆刚(),胡翰
中国矿业大学(北京)管理学院 北京 100083
Study on a Method of Feature Classification Selection Based on χ2 Statistics
Zhanglu Tan,Zhaogang Wang(),Han Hu
School of Management, China University of Mining and Technology, Beijing 100083, China
全文: PDF(1118 KB)   HTML ( 1
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】针对传统χ2统计无法保证各类别之间信息的均衡性从而影响分类效果的问题, 改进χ2统计以提高其应用效果。【方法】通过分析传统χ2统计的特征选择过程及其局限, 提出一种基于χ2统计的特征分类选择方法, 根据特征词与每一类的关联度分类别选取特征词。【结果】以SVM为分类模型, 通过实验对比改进前后的方法对文本分类效果的影响, 结果表明基于χ2统计的特征分类选择方法在准确率、平均分类准确率、最低分类准确率、稳定性和系统运行时间等方面得到显著改善。【局限】特征词选取数量较少时, 改进前后差异不明显。【结论】基于χ2统计的特征分类选择方法, 有效改善了分类模型的稳定性与泛化性能, 使分类准确率的波动幅度减小, 分类过程的效率显著提高。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
谭章禄
王兆刚
胡翰
关键词 χ2统计特征选择文本分类稳定性    
Abstract

[Objective] This paper aims at improving the application effect by improving χ2 statistics. The deficiency of traditional χ2 statistics could not guarantee the balance of information between categories and influence the classification effect. [Methods] By analyzing the characteristics selection process of traditional χ2 statistics and its limitations, a feature classification selection method based on χ2 statistics was proposed, and the feature words of different classes were selected according to the correlation degree between the feature words and each class. [Results] The effect of the improved method on the text classification effect was compared with the SVM as the classification model. The results showed that the feature classification selection method based on χ2 statistics made the accuracy, the average classification accuracy, the lowest classification accuracy, the stability and the system running time significantly improved. [Limitations] When the number of feature words selected was small, the difference was not obvious before and after improvement. [Conclusions] The method of feature classification selection based on χ2 statistics could effectively improve the stability and generalization performance of the classification model, reduce the fluctuation of classification accuracy and improve the efficiency of classification process.

Key wordsχ2 Statistics    Feature Selection    Text Categorization    Stability
收稿日期: 2018-05-07     
基金资助:*本文系国家自然科学基金项目“基于数据挖掘的煤矿安全可视化管理模型及图元体系研究”(项目编号: 61471362)的研究成果之一
引用本文:   
谭章禄,王兆刚,胡翰. 一种基于χ2统计的特征分类选择方法研究*[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2018.0509.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.0509
[1] Yang Y, Liu X.A Re-examination of Text Categorization Methods[C]// Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 1999: 42-49.
[2] Yang Y.An Evaluation of Statistical Approaches to Text Categorization[J].Information Retrieval, 1999, 1(1-2): 69-90.
[3] 路永和, 陈景煌. 混合蛙跳算法在文本分类特征选择优化中的应用[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[3] (Lu Yonghe, Chen Jinghuang.Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. Data Analysis and Knowledge Discovery, 2017, 1(1): 91-101.)
[4] 王东波, 何琳, 黄水清. 基于支持向量机的先秦诸子典籍自动分类研究[J]. 图书情报工作, 2017, 61(12): 71-76.
[4] (Wang Dongbo, He Lin, Huang Shuiqing.Research of Automatic Classification for Pre-Qin Philosophers Literature Based on the Support Vector Machine[J]. Library and Information Service, 2017, 61(12): 71-76.)
[5] 胡韧奋, 诸雨辰. 唐诗题材自动分类研究[J]. 北京大学学报: 自然科学版, 2015, 51(2): 262-268.
[5] (Hu Renfen, Zhu Yuchen.Automatic Classification of Tang Poetry Themes[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2015, 51(2): 262-268.)
[6] Meesad P, Boonrawd P, Nuipian V.A Chi-square-test for Word Importance Differentiation in Text Classification[C]// Proceedings of the 2011 International Conference on Information and Electronics Engineering. 2011.
[7] 王光, 邱云飞, 史庆伟. 集合CHI与IG的特征选择方法[J]. 计算机应用研究, 2012, 29(7): 2454-2456.
[7] (Wang Guang, Qiu Yunfei, Shi Qingwei.Collective CHI and IG Feature Selection Method[J]. Application Research of Computers, 2012, 29(7): 2454-2456.)
[8] Dai L, Hu J, Liu W.Using Modified CHI Square and Rough Set for Text Categorization with Many Redundant Features[C]// Proceedings of the 2008 International Symposium on Computational Intelligence & Design. 2008.
[9] Galavotti L, Sebastiani F, Simi M, et al.Feature Selection and Negative Evidence in Automated Text Categorization[C]// Proceedings of the ACM KDD Workshop on Text Mining. 2000.
[10] Jin C, Ma T, Hou R, et al.Chi-square Statistics Feature Selection Based on Term Frequency and Distribution for Text Categorization[J]. IETE Journal of Research, 2015, 61(4): 351-362.
[11] 闫健卓, 李鹏英, 方丽英, 等. 基于χ2统计的改进文本特征选择方法[J]. 计算机工程与设计, 2016, 37(5): 1391-1394.
[11] (Yan Jianzhuo, Li Pengying, Fang Liying, et al.Improved Method for Text Feature Selection Based on CHI[J]. Computer Engineering and Design, 2016, 37(5): 1391-1394.)
[12] 熊忠阳, 张鹏招, 张玉芳. 基于χ2统计的文本分类特征选择方法的研究[J]. 计算机应用, 2008, 28(2): 513-518.
[12] (Xiong Zhongyang, Zhang Pengzhao, Zhang Yufang.Improved Approach to CHI in Feature Extraction[J]. Computer Applications, 2008, 28(2): 513-518.)
[13] 刘海峰, 苏展, 刘守生. 一种基于词频信息的改进CHI文本特征选择[J]. 计算机工程与应用, 2013, 49(22): 110-114.
[13] (Liu Haifeng, Su Zhan, Liu Shousheng.Improved CHI Text Feature Selection Based on Word Frequency Information[J]. Computer Engineering and Applications, 2013, 49(22): 110-114.)
[14] 肖婷, 唐雁. 改进的χ2统计文本特征选择方法[J]. 计算机工程与应用, 2009, 45(14): 136-140.
[14] (Xiao Ting, Tang Yan.Improved χ2 Statistics Method for Text Feature Selection[J]. Computer Engineering and Applications, 2009, 45(14): 136-137.)
[15] Li Y, Luo C, Chung S M.Text Clustering with Feature Selection by Using Statistical Data[J]. IEEE Transactions on Knowledge and Data Engineering, 2008, 20(5): 641-652.
[16] 张辉宜, 谢业名, 袁志祥, 等. 一种基于概率的卡方特征选择方法[J]. 计算机工程, 2016, 42(8): 194-198, 205.
[16] (Zhang Huiyi, Xie Yeming, Yuan Zhixiang, et al.A Method of CHI-square Feature Selection Based on Probability[J]. Computer Engineering, 2016, 42(8): 194-198, 205. )
[17] 裴英博, 刘晓霞. 文本分类中改进型CHI特征选择方法的研究[J]. 计算机工程与应用, 2011, 47(4): 128-130.
[17] (Pei Yingbo, Liu Xiaoxia.Study on Improved CHI for Feature Selection in Chinese Text Categorization[J]. Computer Engineering and Applications, 2011, 47(4): 128-130.)
[18] 李平, 戴月明, 王艳. 基于混合卡方统计量与逻辑回归的文本情感分析[J]. 计算机工程, 2017, 43(12): 192-196, 202.
[18] (Li Ping, Dai Yueming, Wang Yan.Text Sentiment Analysis Based on Hybrid Chi-square Statistic and Logistic Regression[J]. Computer Engineering, 2017, 43(12): 192-196, 202.)
[19] 邱云飞, 王威, 刘大有, 等. 基于方差的CHI特征选择方法[J]. 计算机应用研究, 2012, 29(4): 1304-1306.
[19] (Qiu Yunfei, Wang Wei, Liu Dayou, et al.CHI Feature Selection Method Based on Variance[J]. Application Research of Computers, 2012, 29(4): 1304-1306.)
[20] 徐明, 高翔, 许志刚, 等. 基于改进卡方统计的微博特征提取方法[J]. 计算机工程与应用, 2014, 50(19): 113-117, 142.
[20] (Xu Ming, Gao Xiang, Xu Zhigang, et al.Feature Selection Methods of Microblogging Based on Improved CHI-square Statistics[J]. Computer Engineering and Applications, 2014, 50(19): 113-117, 142.)
[21] 史峰, 王辉, 郁磊, 等. MATLAB智能算法30个案例分析[M]. 第一版. 北京: 北京航空航天大学出版社, 2011: 275-278.
[21] (Shi Feng, Wang Hui, Yu Lei, et al.30 Cases Analysis of MATLAB Intelligent Algorithm[M]. The 1st Edition. Beijing: BeiHang University Press, 2011: 275-278.)
[1] 周成,魏红芹. 专利价值评估与分类研究*——基于自组织映射支持向量机[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[2] 梁家铭,赵洁,Jianlong Zhou,董振宁. 用户隐式行为挖掘在抗信誉共谋中的应用研究*[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
[3] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[4] 温廷新,李洋子,孙静霜. 基于多因素特征选择与AFOA/K-means的新闻热点发现方法*[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[5] 张紫玄,王昊,朱立平,邓三鸿. 中国海关HS编码风险的识别研究*[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[6] 李心蕾,王昊,刘小敏,邓三鸿. 面向微博短文本分类的文本向量化方法比较研究*[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[7] 李琳,李辉. 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[8] 温廷新,李洋子,孙静霜. 基于改进的果蝇优化算法的文本特征选择优化模型[J]. 数据分析与知识发现, 2018, 2(5): 59-69.
[9] 刘浏,王东波. 基于论文自动分类的社科类学科跨学科性研究*[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[10] 冯国明,张晓冬,刘素辉. 基于CapsNet的中文文本分类研究*[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[11] 操玮,李灿,贺婷婷,朱卫东. 基于集成学习的中国P2P网络借贷信用风险预警模型的对比研究*[J]. 数据分析与知识发现, 2018, 2(10): 65-76.
[12] 李志鹏,李卫忠. 基于可拓小生境量子粒子群算法的特征选择*[J]. 数据分析与知识发现, 2017, 1(7): 82-89.
[13] 张越,王东波,朱丹浩. 面向食品安全突发事件汉语分词的特征选择及模型优化研究*[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[14] 李湘东,阮涛,刘康. 基于维基百科的多种类型文献自动分类研究*[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[15] 路永和,陈景煌. 混合蛙跳算法在文本分类特征选择优化中的应用*[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn