Please wait a minute...
New Technology of Library and Information Service  2008, Vol. 24 Issue (11): 44-48    DOI: 10.11925/infotech.1003-3513.2008.11.09
Current Issue | Archive | Adv Search |
A Feature Selection Method for Text Classification Based on Statistical Frequency
Zhang Junli  Zhao Naixuan  Feng Jun 
(Library of Nanjing University of Technology, Nanjing 210009, China)
Download: PDF (468 KB)  
Export: BibTeX | EndNote (RIS)      
Abstract  

This paper analyzes Chi-square algorithm (CHI), which is unreliable for low-document frequency, and can’t show the pertinence for term and classification. A new Statistical Frequency algorithm (SF) is proposed according to the chief shortcomings. The experiments of the SF algorithm is validated by comparison, the results show that improved algorithm performs better.

Key wordsText categorization      Feature selection      KNN      Chi-square     
Received: 13 August 2008      Published: 25 November 2008
ZTFLH: 

TP391

 
Corresponding Authors: Zhang Junli     E-mail: elili62@126.com
About author:: Zhang Junli,Zhao Naixuan,Feng Jun

Cite this article:

Zhang Junli,Zhao Naixuan,Feng Jun . A Feature Selection Method for Text Classification Based on Statistical Frequency. New Technology of Library and Information Service, 2008, 24(11): 44-48.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2008.11.09     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2008/V24/I11/44

[1] 张俊丽.文本分类中的关键技术研究[D].武汉:华中师范大学,2008.
[2] Yang Y M,  Liu X. A re-examination of Text Categorization Methods.22nd Annual International SIGIR[J], In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999:42-49.
[3] 张俊丽,张帆.改进KNN算法在垃圾邮件过滤中的应用[J].现代图书情报技术,2007(4):75-78.
[4] 北京大学计算语言学研究所[EB/OL]. [2008-08-05].http://www.icl.pku.edu.cn/default_cn.asp.
[5] Salton G, Wong A, Yang C S. A Vector Model for Automatic Indexing[J]. Communication of ACM,1975,18(11):613-620.
[6] Salton G, McGill M J. Introduction to Modern Information Retrieval[M]. McGraw Hill, Computer Series, 1983.
[7] Mladenic D, Grobelnik M. Feature Selection for Classification Based on Text Hierarchy[C]. In: Working Notes of Learning from Text and the Web, Conference on Automated Learning and Discovery (CONALD’98), 1998.
[8] Cover T M, Hart P E. Nearest Neighbor Pattern Classification[J].IEEE Trans.Inform.Theory,1967(13):23.
[9] 张俊丽,张帆.KNN-FCM聚类算法在构建智能搜索引擎系统中的应用[J].图书与情报,2007(4):48-51,62.
[10] Sakkis G, Androutsopoulos I.Stacking Classifiers for Anti-spam Filtering of Email [C].In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2001:44-50.
[11] Yang Y. An Evaluation of Statistical Approaches to Text Categorization[J]. Information Retrieval,1999,1(1):76-78.
[12] 张帆.信息组织学[M].北京:科学出版社,2005:411-412.

[1] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[2] Jiaming Liang,Jie Zhao,Zhou Jianlong,Zhenning Dong. Detecting Collusive Fraudulent Online Transaction with Implicit User Behaviors[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
[3] Wancheng Chen,Haoran Dai,Yinghan Jin. Appraising Home Prices with HEDONIC Model: Case Study of Seattle, U.S.[J]. 数据分析与知识发现, 2019, 3(5): 19-26.
[4] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[5] Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[6] Li Xiangdong,Gao Fan,Li Youhai. Categorizing Documents Automatically within Common Semantic Space[J]. 数据分析与知识发现, 2018, 2(9): 66-73.
[7] Wen Tingxin,Li Yangzi,Sun Jingshuang. Extracting Text Features with Improved Fruit Fly Optimization Algorithm[J]. 数据分析与知识发现, 2018, 2(5): 59-69.
[8] Liu Liu,Wang Dongbo. Identifying Interdisciplinary Social Science Research Based on Article Classification[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[9] Feng Guoming,Zhang Xiaodong,Liu Suhui. Classifying Chinese Texts with CapsNet[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[10] Li Zhipeng,Li Weizhong. Feature Selection Based on Modified QPSO Algorithm[J]. 数据分析与知识发现, 2017, 1(7): 82-89.
[11] Zhang Yue,Wang Dongbo,Zhu Danhao. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[12] Li Xiangdong,Ruan Tao,Liu Kang. Automatic Classification of Documents from Wikipedia[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[13] Lu Yonghe,Chen Jinghuang. Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[14] Liu Hongguang,Ma Shuanggang,Liu Guifeng. Classifying Chinese News Texts with Denoising Auto Encoder[J]. 现代图书情报技术, 2016, 32(6): 12-19.
[15] Meng Yuan,Wang Hongwei. Evaluating Online Reviews Based on Text Content Features[J]. 现代图书情报技术, 2016, 32(4): 40-47.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn