通过分析χ2统计量(Chi-square, CHI)的缺陷和不足,针对它对低文档频的特征项不可靠,而且不能说明词条和类别的相关性的缺点,对其进行改进,提出统计频率(Statistical Frequency, SF )算法。实验结果表明,统计频率算法能够弥补这些不足,在文本分类中表现出良好的分类效果。
This paper analyzes Chi-square algorithm (CHI), which is unreliable for low-document frequency, and can’t show the pertinence for term and classification. A new Statistical Frequency algorithm (SF) is proposed according to the chief shortcomings. The experiments of the SF algorithm is validated by comparison, the results show that improved algorithm performs better.
张俊丽,赵乃瑄,冯君. 基于统计频率的文本分类特征选择算法研究*[J]. 现代图书情报技术, 2008, 24(11): 44-48.
Zhang Junli,Zhao Naixuan,Feng Jun . A Feature Selection Method for Text Classification Based on Statistical Frequency. New Technology of Library and Information Service, 2008, 24(11): 44-48.
[1] 张俊丽.文本分类中的关键技术研究[D].武汉:华中师范大学,2008.
[2] Yang Y M, Liu X. A re-examination of Text Categorization Methods.22nd Annual International SIGIR[J], In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999:42-49.
[3] 张俊丽,张帆.改进KNN算法在垃圾邮件过滤中的应用[J].现代图书情报技术,2007(4):75-78.
[4] 北京大学计算语言学研究所[EB/OL]. [2008-08-05].http://www.icl.pku.edu.cn/default_cn.asp.
[5] Salton G, Wong A, Yang C S. A Vector Model for Automatic Indexing[J]. Communication of ACM,1975,18(11):613-620.
[6] Salton G, McGill M J. Introduction to Modern Information Retrieval[M]. McGraw Hill, Computer Series, 1983.
[7] Mladenic D, Grobelnik M. Feature Selection for Classification Based on Text Hierarchy[C]. In: Working Notes of Learning from Text and the Web, Conference on Automated Learning and Discovery (CONALD’98), 1998.
[8] Cover T M, Hart P E. Nearest Neighbor Pattern Classification[J].IEEE Trans.Inform.Theory,1967(13):23.
[9] 张俊丽,张帆.KNN-FCM聚类算法在构建智能搜索引擎系统中的应用[J].图书与情报,2007(4):48-51,62.
[10] Sakkis G, Androutsopoulos I.Stacking Classifiers for Anti-spam Filtering of Email [C].In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2001:44-50.
[11] Yang Y. An Evaluation of Statistical Approaches to Text Categorization[J]. Information Retrieval,1999,1(1):76-78.
[12] 张帆.信息组织学[M].北京:科学出版社,2005:411-412.