多因素影响的特征选择方法

doi:10.11925/infotech.1003-3513.2013.05.04

现代图书情报技术

2013, Vol.

Issue (5): 34-39 https://doi.org/10.11925/infotech.1003-3513.2013.05.04

知识组织与知识管理

本期目录 | 过刊浏览 | 高级检索

多因素影响的特征选择方法

路永和, 李焰锋

中山大学资讯管理学院广州 510006

A Feature Selection Based on Consideration of Multiple Factors

Lu Yonghe, Li Yanfeng

School of Information Management, Sun Yat-Sen University, Guangzhou 510006, China

摘要
参考文献
相关文章
Metrics

全文: PDF (728 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要在特征选择过程中,通过特征选择评估函数得到的词的权值大小决定该词是否作为特征词,然而词的权值受多种因素影响,主要因素有词的重要性、特征性和代表性。从以上几个因素出发,构建新的特征选择函数TW,通过对词的卡方分布CHI、信息增益IG和新的特征选择函数TW做对比实验,验证TW能够提高类别中专有词汇的权值,降低常见但对分类不重要的特征的权值;将TW作为新的特征选择算法,通过在中文分类语料库中分别采用KNN、类中心和支持向量机(SVM)三种分类方法进行实际分类实验,并与其他特征选择算法进行比较,验证该特征选择算法的有效性。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	李焰锋
	路永和

关键词 ：文本分类, 特征选择, 类别区分, TF-IDF

Abstract：In the process of feature selection, term’s weight determines whether the term can be a feature. But the weight is affected by many factors, the main factors are term’s importance, characteristics and representative. With the consideration of those factors, a new function TW (Term Weight) based on the importance of the feature and the ability of category distinguishing, is brought to be an improved method to select features. After that, experiments on the comparison between term’s CHI, IG and TW validate that TW can increase the weight of special features in a class and can decrease the weight of unimportant features. Finally, the validity of the new algorithm in feature selection is proved by the classification experiments on Chinese classification corpus by three classifiers.

Key words： Text categorization Feature selection Class discrimination TF-IDF

收稿日期: 2013-04-16 出版日期: 2013-07-03

TP391

基金资助:本文系国家高技术研究发展计划(863计划)资助项目“农产品全供应链多源信息感知技术与产品开发”(项目编号:2012AA101701)的研究成果之一。

通讯作者: 路永和 E-mail: zsuluyonghe@163.com

引用本文:

路永和, 李焰锋. 多因素影响的特征选择方法[J]. 现代图书情报技术, 2013, (5): 34-39.
Lu Yonghe, Li Yanfeng. A Feature Selection Based on Consideration of Multiple Factors. New Technology of Library and Information Service, 2013, (5): 34-39.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2013.05.04 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2013/V/I5/34

[1] 台德艺, 王俊. 文本分类特征权重改进算法[J]. 计算机工程 , 2010, 36(9):197-199, 202.(Tai Deyi, Wang Jun. Improved Feature Weighting Algorithm for Text Categorization[J]. Computer Engineering, 2010, 36(9):197-199, 202.)
[2] Shannon C E. A Mathematical Theory of Communication [J]. The Bell System Technical Journal, 1948, 27:379-423, 623-656.
[3] Yang Y, Pederson J O. A Comparative Study on Feature Selection in Text Categorization[C]. In: Proceedings of the 14th International Conference on Machine Learning (ICML’ 97). San Francisco: Morgan Kaufmann Publishers Inc., 1997: 412-420.
[4] 张帆, 张俊丽. 统计频率算法在文本信息过滤系统中的应用[J]. 图书情报工作 , 2009, 53(13):116-119.(Zhang Fan, Zhang Junli. A Feature Selection Method for Text Information Filtering Based on Statistical Frequency[J]. Library and Information Service, 2009, 53(13):116-119.)
[5] 刘庆和, 梁正友. 一种基于信息增益的特征优化选择方法[J]. 计算机工程与应用 , 2011, 47(12):130-132, 136.(Liu Qinghe, Liang Zhengyou. Optimized Approach of Feature Selection Based on Information Gain[J]. Computer Engineering and Applications, 2011, 47(12):130-132, 136.)
[6] 代六玲, 黄河燕, 陈肇雄. 中文文本分类中特征抽取方法的比较研究[J]. 中文信息学报 , 2004, 18(1):26-32.(Dai Liuling, Huang Heyan, Chen Zhaoxiong. A Comparative Study on Feature Selection in Chinese Text Categorization[J]. Journal of Chinese Information Processing, 2004, 18(1):26-32.)
[7] 熊忠阳, 张鹏招, 张玉芳. 基于χ²统计的文本分类特征选择方法的研究[J]. 计算机应用 , 2008, 28(2):513-514, 518.(Xiong Zhongyang, Zhang Pengzhao, Zhang Yufang. Improved Approach to CHI in Feature Extraction[J]. Journal of Computer Applications, 2008, 28(2):513-514, 518.)
[8] 王卫玲, 刘培玉, 初建崇. 一种改进的基于条件互信息的特征选择算法[J]. 计算机应用 , 2007, 27(2):433-435.(Wang Weiling, Liu Peiyu, Chu Jianchong. Improved Feature Selection Algorithm with Conditional Mutual Information[J]. Journal of Computer Applications, 2007, 27(2):433-435.)
[9] Shankar S, Karypis G. A Feature Weight Adjustment Algorithm for Document Categorization[C]. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA.2000.
[10] Lu Z, Liu Y, Zhao S, et al. Study on Feature Selection and Weighting Based on Synonym Merge in Text Categorization[C]. In: Proceedings of the 2nd International Conference on Future Networks (ICFN’10). 2010: 105-109.
[11] Khan A, Baharudin B, Khan K. Efficient Feature Selection and Domain Relevance Term Weighting Method for Document Classification[C]. In: Proceedings of the 2nd International Conference on Computer Engineering and Applications (ICCEA’ 10). Washington, DC: IEEE Computer Society, 2010: 398-403.
[12] 刘海峰, 王元元, 张学仁. 文本分类中一种改进的特征选择方法[J]. 情报科学 , 2007, 25(10):1534-1537.(Liu Haifeng, Wang Yuanyuan, Zhang Xueren. An Improved Feature Selection Method in Text Classification[J]. Information Science, 2007, 25(10):1534-1537.)
[13] 赵小华, 马建芬. 文本分类算法中词语权重计算方法的改进[J]. 电脑知识与技术 , 2009, 5(36):10626-10628.(Zhao Xiaohua, Ma Jianfen. Modify the Method of Feature’s Weight in Text Classification[J]. Computer Knowledge and Technology, 2009, 5(36):10626-10628.)
[14] 数据堂. 中文文本分类语料库[EB/OL]. [2011-10-30]. http://www.datatang.com/datares/detail.aspx?id=11963. (Datatang. Chinese Text Classification Corpus[EB/OL]. [2011-10-30]. http://www.datatang.com/datares/detail.aspx?id=11963.)
[15] 柳培林. 基于向量空间模型的中文文本分类技术研究[D]. 大庆:大庆石油学院, 2006.(Liu Peilin. Research on Classification of Chinese Documents Based on Vector Space Model[D]. Daqing: Northeast Petroleum University, 2006.)
[16] Soucy P, Mineau G W. Beyond TFIDF Weighting for Text Categorization in the Vector Space Model[C]. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI’05). San Francisco: Morgan Kaufmann Publishers Inc., 2005: 1130-1135.

[1]	陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2]	周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3]	余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究^*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[4]	梁家铭, 赵洁, 郑鹏, 黄流深, 叶敏祺, 董振宁. 特征选择下融合图像和文本分析的在线短租平台信任计算框架 ^*[J]. 数据分析与知识发现, 2021, 5(2): 129-140.
[5]	王艳, 王胡燕, 余本功. 基于多特征融合的中文文本分类研究^*[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[6]	唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 ^*[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[7]	王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[8]	徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型^*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[9]	彭郴,吕学强,孙宁,张乐,姜肇财,宋黎. 基于CNN的消费品缺陷领域词典构建方法研究*[J]. 数据分析与知识发现, 2020, 4(11): 112-120.
[10]	徐彤彤,孙华志,马春梅,姜丽芬,刘逸琛. 基于双向长效注意力特征表达的少样本文本分类模型研究^*[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[11]	余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[12]	聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 ^*[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[13]	邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 ^*[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[14]	秦贺然,刘浏,李斌,王东波. 融入实体特征的典籍自动分类研究 ^*[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[15]	陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 ^*[J]. 数据分析与知识发现, 2019, 3(8): 53-61.

Viewed

Full text

Abstract

Cited

Shared

Discussed