基于统计频率的文本分类特征选择算法研究*

doi:10.11925/infotech.1003-3513.2008.11.09

现代图书情报技术

2008, Vol. 24

Issue (11): 44-48 https://doi.org/10.11925/infotech.1003-3513.2008.11.09

知识组织与知识管理

本期目录 | 过刊浏览 | 高级检索

基于统计频率的文本分类特征选择算法研究*

张俊丽赵乃瑄冯君

（南京工业大学图书馆南京 210009）

A Feature Selection Method for Text Classification Based on Statistical Frequency

Zhang Junli Zhao Naixuan Feng Jun

(Library of Nanjing University of Technology， Nanjing 210009, China)

摘要
参考文献
相关文章
Metrics

全文: PDF (468 KB)
输出: BibTeX | EndNote (RIS)

摘要

通过分析χ2统计量（Chi-square, CHI）的缺陷和不足，针对它对低文档频的特征项不可靠，而且不能说明词条和类别的相关性的缺点，对其进行改进，提出统计频率(Statistical Frequency, SF )算法。实验结果表明，统计频率算法能够弥补这些不足，在文本分类中表现出良好的分类效果。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	张俊丽
	冯君
	赵乃瑄

关键词 ：文本分类, 特征选择, KNN, χ2统计量

Abstract：

This paper analyzes Chi-square algorithm (CHI), which is unreliable for low-document frequency, and can’t show the pertinence for term and classification. A new Statistical Frequency algorithm (SF) is proposed according to the chief shortcomings. The experiments of the SF algorithm is validated by comparison, the results show that improved algorithm performs better.

Key words： Text categorization Feature selection KNN Chi-square

收稿日期: 2008-08-13 出版日期: 2008-11-25

TP391

基金资助:

*本文系江苏省教育厅高校哲学社会科学基金项目“江苏高校数字图书馆引进资源的绩效评价与发展战略研究”(项目编号：08SJB8700004)的研究成果之一。

通讯作者: 张俊丽 E-mail: elili62@126.com

作者简介: 张俊丽,赵乃瑄,冯君

引用本文:

张俊丽,赵乃瑄,冯君. 基于统计频率的文本分类特征选择算法研究*[J]. 现代图书情报技术, 2008, 24(11): 44-48.
Zhang Junli,Zhao Naixuan,Feng Jun . A Feature Selection Method for Text Classification Based on Statistical Frequency. New Technology of Library and Information Service, 2008, 24(11): 44-48.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2008.11.09 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2008/V24/I11/44

［1］张俊丽.文本分类中的关键技术研究［D］.武汉:华中师范大学,2008.
［2］ Yang Y M, Liu X. A re-examination of Text Categorization Methods.22nd Annual International SIGIR［J］, In： Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999:42-49.
［3］张俊丽,张帆.改进KNN算法在垃圾邮件过滤中的应用［J］.现代图书情报技术,2007(4):75-78.
［4］北京大学计算语言学研究所［EB/OL］. ［2008-08-05］.http://www.icl.pku.edu.cn/default_cn.asp.
［5］ Salton G, Wong A, Yang C S. A Vector Model for Automatic Indexing［J］. Communication of ACM,1975,18(11):613-620.
［6］ Salton G, McGill M J. Introduction to Modern Information Retrieval［M］. McGraw Hill, Computer Series, 1983.
［7］ Mladenic D, Grobelnik M. Feature Selection for Classification Based on Text Hierarchy［C］. In: Working Notes of Learning from Text and the Web, Conference on Automated Learning and Discovery (CONALD’98), 1998.
［8］ Cover T M, Hart P E. Nearest Neighbor Pattern Classification［J］.IEEE Trans.Inform.Theory,1967(13):23.
［9］张俊丽,张帆.KNN-FCM聚类算法在构建智能搜索引擎系统中的应用［J］.图书与情报,2007(4):48-51,62.
［10］ Sakkis G, Androutsopoulos I.Stacking Classifiers for Anti-spam Filtering of Email ［C］.In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2001:44-50.
［11］ Yang Y. An Evaluation of Statistical Approaches to Text Categorization［J］. Information Retrieval,1999,1(1):76-78.
［12］张帆.信息组织学［M］.北京:科学出版社,2005:411-412.

[1]	陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2]	周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3]	余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究^*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[4]	梁家铭, 赵洁, 郑鹏, 黄流深, 叶敏祺, 董振宁. 特征选择下融合图像和文本分析的在线短租平台信任计算框架 ^*[J]. 数据分析与知识发现, 2021, 5(2): 129-140.
[5]	王艳, 王胡燕, 余本功. 基于多特征融合的中文文本分类研究^*[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[6]	唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 ^*[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[7]	王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[8]	徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型^*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[9]	徐彤彤,孙华志,马春梅,姜丽芬,刘逸琛. 基于双向长效注意力特征表达的少样本文本分类模型研究^*[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[10]	余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[11]	聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 ^*[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[12]	邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 ^*[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[13]	秦贺然,刘浏,李斌,王东波. 融入实体特征的典籍自动分类研究 ^*[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[14]	陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 ^*[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[15]	周成,魏红芹. *专利价值评估与分类研究^——基于自组织映射支持向量机**[J]. 数据分析与知识发现, 2019, 3(5): 117-124.

Viewed

Full text

Abstract

Cited

Shared

Discussed