一种基于χ2统计的特征分类选择方法研究*

doi:10.11925/infotech.2096-3467.2018.0509

数据分析与知识发现

2019, Vol. 3

Issue (2): 72-78 https://doi.org/10.11925/infotech.2096-3467.2018.0509

研究论文

本期目录 | 过刊浏览 | 高级检索

一种基于χ²统计的特征分类选择方法研究^*

谭章禄,王兆刚(

),胡翰

中国矿业大学(北京)管理学院北京 100083

Study on a Method of Feature Classification Selection Based on χ² Statistics

Zhanglu Tan,Zhaogang Wang(

),Han Hu

School of Management, China University of Mining and Technology, Beijing 100083, China

摘要
参考文献
相关文章
Metrics

全文: PDF (1118 KB) HTML ( 2 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】针对传统χ²统计无法保证各类别之间信息的均衡性从而影响分类效果的问题, 改进χ²统计以提高其应用效果。【方法】通过分析传统χ²统计的特征选择过程及其局限, 提出一种基于χ²统计的特征分类选择方法, 根据特征词与每一类的关联度分类别选取特征词。【结果】以SVM为分类模型, 通过实验对比改进前后的方法对文本分类效果的影响, 结果表明基于χ²统计的特征分类选择方法在准确率、平均分类准确率、最低分类准确率、稳定性和系统运行时间等方面得到显著改善。【局限】特征词选取数量较少时, 改进前后差异不明显。【结论】基于χ²统计的特征分类选择方法, 有效改善了分类模型的稳定性与泛化性能, 使分类准确率的波动幅度减小, 分类过程的效率显著提高。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	谭章禄
	王兆刚
	胡翰

关键词 ： χ²统计, 特征选择, 文本分类, 稳定性

Abstract：

[Objective] This paper aims at improving the application effect by improving χ² statistics. The deficiency of traditional χ² statistics could not guarantee the balance of information between categories and influence the classification effect. [Methods] By analyzing the characteristics selection process of traditional χ² statistics and its limitations, a feature classification selection method based on χ² statistics was proposed, and the feature words of different classes were selected according to the correlation degree between the feature words and each class. [Results] The effect of the improved method on the text classification effect was compared with the SVM as the classification model. The results showed that the feature classification selection method based on χ² statistics made the accuracy, the average classification accuracy, the lowest classification accuracy, the stability and the system running time significantly improved. [Limitations] When the number of feature words selected was small, the difference was not obvious before and after improvement. [Conclusions] The method of feature classification selection based on χ² statistics could effectively improve the stability and generalization performance of the classification model, reduce the fluctuation of classification accuracy and improve the efficiency of classification process.

Key words： χ² Statistics Feature Selection Text Categorization Stability

收稿日期: 2018-05-07 出版日期: 2019-03-27

基金资助:*本文系国家自然科学基金项目“基于数据挖掘的煤矿安全可视化管理模型及图元体系研究”(项目编号: 61471362)的研究成果之一

引用本文:

谭章禄,王兆刚,胡翰. 一种基于χ²统计的特征分类选择方法研究^*[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ² Statistics. Data Analysis and Knowledge Discovery, 2019, 3(2): 72-78.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.0509 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2019/V3/I2/72

[1]	Yang Y, Liu X.A Re-examination of Text Categorization Methods[C]// Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 1999: 42-49.
[2]	Yang Y.An Evaluation of Statistical Approaches to Text Categorization[J].Information Retrieval, 1999, 1(1-2): 69-90.
[3]	路永和, 陈景煌. 混合蛙跳算法在文本分类特征选择优化中的应用[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[3]	(Lu Yonghe, Chen Jinghuang.Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. Data Analysis and Knowledge Discovery, 2017, 1(1): 91-101.)
[4]	王东波, 何琳, 黄水清. 基于支持向量机的先秦诸子典籍自动分类研究[J]. 图书情报工作, 2017, 61(12): 71-76.
[4]	(Wang Dongbo, He Lin, Huang Shuiqing.Research of Automatic Classification for Pre-Qin Philosophers Literature Based on the Support Vector Machine[J]. Library and Information Service, 2017, 61(12): 71-76.)
[5]	胡韧奋, 诸雨辰. 唐诗题材自动分类研究[J]. 北京大学学报: 自然科学版, 2015, 51(2): 262-268.
[5]	(Hu Renfen, Zhu Yuchen.Automatic Classification of Tang Poetry Themes[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2015, 51(2): 262-268.)
[6]	Meesad P, Boonrawd P, Nuipian V.A Chi-square-test for Word Importance Differentiation in Text Classification[C]// Proceedings of the 2011 International Conference on Information and Electronics Engineering. 2011.
[7]	王光, 邱云飞, 史庆伟. 集合CHI与IG的特征选择方法[J]. 计算机应用研究, 2012, 29(7): 2454-2456.
[7]	(Wang Guang, Qiu Yunfei, Shi Qingwei.Collective CHI and IG Feature Selection Method[J]. Application Research of Computers, 2012, 29(7): 2454-2456.)
[8]	Dai L, Hu J, Liu W.Using Modified CHI Square and Rough Set for Text Categorization with Many Redundant Features[C]// Proceedings of the 2008 International Symposium on Computational Intelligence & Design. 2008.
[9]	Galavotti L, Sebastiani F, Simi M, et al.Feature Selection and Negative Evidence in Automated Text Categorization[C]// Proceedings of the ACM KDD Workshop on Text Mining. 2000.
[10]	Jin C, Ma T, Hou R, et al.Chi-square Statistics Feature Selection Based on Term Frequency and Distribution for Text Categorization[J]. IETE Journal of Research, 2015, 61(4): 351-362.
[11]	闫健卓, 李鹏英, 方丽英, 等. 基于χ2统计的改进文本特征选择方法[J]. 计算机工程与设计, 2016, 37(5): 1391-1394.
[11]	(Yan Jianzhuo, Li Pengying, Fang Liying, et al.Improved Method for Text Feature Selection Based on CHI[J]. Computer Engineering and Design, 2016, 37(5): 1391-1394.)
[12]	熊忠阳, 张鹏招, 张玉芳. 基于χ2统计的文本分类特征选择方法的研究[J]. 计算机应用, 2008, 28(2): 513-518.
[12]	(Xiong Zhongyang, Zhang Pengzhao, Zhang Yufang.Improved Approach to CHI in Feature Extraction[J]. Computer Applications, 2008, 28(2): 513-518.)
[13]	刘海峰, 苏展, 刘守生. 一种基于词频信息的改进CHI文本特征选择[J]. 计算机工程与应用, 2013, 49(22): 110-114.
[13]	(Liu Haifeng, Su Zhan, Liu Shousheng.Improved CHI Text Feature Selection Based on Word Frequency Information[J]. Computer Engineering and Applications, 2013, 49(22): 110-114.)
[14]	肖婷, 唐雁. 改进的χ2统计文本特征选择方法[J]. 计算机工程与应用, 2009, 45(14): 136-140.
[14]	(Xiao Ting, Tang Yan.Improved χ2 Statistics Method for Text Feature Selection[J]. Computer Engineering and Applications, 2009, 45(14): 136-137.)
[15]	Li Y, Luo C, Chung S M.Text Clustering with Feature Selection by Using Statistical Data[J]. IEEE Transactions on Knowledge and Data Engineering, 2008, 20(5): 641-652.
[16]	张辉宜, 谢业名, 袁志祥, 等. 一种基于概率的卡方特征选择方法[J]. 计算机工程, 2016, 42(8): 194-198, 205.
[16]	(Zhang Huiyi, Xie Yeming, Yuan Zhixiang, et al.A Method of CHI-square Feature Selection Based on Probability[J]. Computer Engineering, 2016, 42(8): 194-198, 205. )
[17]	裴英博, 刘晓霞. 文本分类中改进型CHI特征选择方法的研究[J]. 计算机工程与应用, 2011, 47(4): 128-130.
[17]	(Pei Yingbo, Liu Xiaoxia.Study on Improved CHI for Feature Selection in Chinese Text Categorization[J]. Computer Engineering and Applications, 2011, 47(4): 128-130.)
[18]	李平, 戴月明, 王艳. 基于混合卡方统计量与逻辑回归的文本情感分析[J]. 计算机工程, 2017, 43(12): 192-196, 202.
[18]	(Li Ping, Dai Yueming, Wang Yan.Text Sentiment Analysis Based on Hybrid Chi-square Statistic and Logistic Regression[J]. Computer Engineering, 2017, 43(12): 192-196, 202.)
[19]	邱云飞, 王威, 刘大有, 等. 基于方差的CHI特征选择方法[J]. 计算机应用研究, 2012, 29(4): 1304-1306.
[19]	(Qiu Yunfei, Wang Wei, Liu Dayou, et al.CHI Feature Selection Method Based on Variance[J]. Application Research of Computers, 2012, 29(4): 1304-1306.)
[20]	徐明, 高翔, 许志刚, 等. 基于改进卡方统计的微博特征提取方法[J]. 计算机工程与应用, 2014, 50(19): 113-117, 142.
[20]	(Xu Ming, Gao Xiang, Xu Zhigang, et al.Feature Selection Methods of Microblogging Based on Improved CHI-square Statistics[J]. Computer Engineering and Applications, 2014, 50(19): 113-117, 142.)
[21]	史峰, 王辉, 郁磊, 等. MATLAB智能算法30个案例分析[M]. 第一版. 北京: 北京航空航天大学出版社, 2011: 275-278.
[21]	(Shi Feng, Wang Hui, Yu Lei, et al.30 Cases Analysis of MATLAB Intelligent Algorithm[M]. The 1st Edition. Beijing: BeiHang University Press, 2011: 275-278.)

[1]	陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2]	周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3]	余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究^*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[4]	梁家铭, 赵洁, 郑鹏, 黄流深, 叶敏祺, 董振宁. 特征选择下融合图像和文本分析的在线短租平台信任计算框架 ^*[J]. 数据分析与知识发现, 2021, 5(2): 129-140.
[5]	王艳, 王胡燕, 余本功. 基于多特征融合的中文文本分类研究^*[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[6]	唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 ^*[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[7]	王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[8]	徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型^*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[9]	徐彤彤,孙华志,马春梅,姜丽芬,刘逸琛. 基于双向长效注意力特征表达的少样本文本分类模型研究^*[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[10]	余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[11]	聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 ^*[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[12]	邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 ^*[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[13]	秦贺然,刘浏,李斌,王东波. 融入实体特征的典籍自动分类研究 ^*[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[14]	陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 ^*[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[15]	周成,魏红芹. *专利价值评估与分类研究^——基于自组织映射支持向量机**[J]. 数据分析与知识发现, 2019, 3(5): 117-124.

Viewed

Full text

Abstract

Cited

Shared

Discussed