New Technology of Library and Information Service  2008, Vol. 24 Issue (11): 44-48    DOI: 10.11925/infotech.1003-3513.2008.11.09
A Feature Selection Method for Text Classification Based on Statistical Frequency
Zhang Junli  Zhao Naixuan  Feng Jun 
(Library of Nanjing University of Technology, Nanjing 210009, China)
This paper analyzes Chi-square algorithm (CHI), which is unreliable for low-document frequency, and can’t show the pertinence for term and classification. A new Statistical Frequency algorithm (SF) is proposed according to the chief shortcomings. The experiments of the SF algorithm is validated by comparison, the results show that improved algorithm performs better.

Key wordsText categorization      Feature selection      KNN      Chi-square     
Received: 13 August 2008      Published: 25 November 2008


Corresponding Authors: Zhang Junli     E-mail:
About author:: Zhang Junli,Zhao Naixuan,Feng Jun

[1] 张俊丽.文本分类中的关键技术研究[D].武汉:华中师范大学,2008.
[2] Yang Y M,  Liu X. A re-examination of Text Categorization Methods.22nd Annual International SIGIR[J], In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999:42-49.
[3] 张俊丽,张帆.改进KNN算法在垃圾邮件过滤中的应用[J].现代图书情报技术,2007(4):75-78.
[4] 北京大学计算语言学研究所[EB/OL]. [2008-08-05].
[5] Salton G, Wong A, Yang C S. A Vector Model for Automatic Indexing[J]. Communication of ACM,1975,18(11):613-620.
[6] Salton G, McGill M J. Introduction to Modern Information Retrieval[M]. McGraw Hill, Computer Series, 1983.
[7] Mladenic D, Grobelnik M. Feature Selection for Classification Based on Text Hierarchy[C]. In: Working Notes of Learning from Text and the Web, Conference on Automated Learning and Discovery (CONALD’98), 1998.
[8] Cover T M, Hart P E. Nearest Neighbor Pattern Classification[J].IEEE Trans.Inform.Theory,1967(13):23.
[9] 张俊丽,张帆.KNN-FCM聚类算法在构建智能搜索引擎系统中的应用[J].图书与情报,2007(4):48-51,62.
[10] Sakkis G, Androutsopoulos I.Stacking Classifiers for Anti-spam Filtering of Email [C].In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2001:44-50.
[11] Yang Y. An Evaluation of Statistical Approaches to Text Categorization[J]. Information Retrieval,1999,1(1):76-78.
[12] 张帆.信息组织学[M].北京:科学出版社,2005:411-412.

