New Technology of Library and Information Service  2014, Vol. 30 Issue (9): 66-73    DOI: 10.11925/infotech.1003-3513.2014.09.09
A Text Classification Algorithm Based on the Average Category Similarity
Tan Xueqing, Zhou Tong, Luo Lin
School of Information Management, Wuhan University, Wuhan 430072, China
[Objective] To improve the classification performance and classification speed based on the KNN algorithm. [Methods] This paper proposes a classification algorithm based on the average category similarity, to judge the type of the test text by calculating the mean value of the text similarities of the test text and all texts of each category in the training set. [Results] The experimental results on the Fudan, balanced Sogou and unbalanced Sogou public corpus show that compared with KNN classification algorithm, the Macro_F1 on the two corpora of the method in this paper is increased by 3.5%, 3.2% and 3.3% respectively, the classification speed is 1/22, 1/6 and 1/5 respectively of KNN algorithm. [Limitations] Considering the time efficiency of KNN algorithm, the number of text of the experimental data is few. [Conclusions] It is a kind of practical classification algorithm for large scale text classification contrast with KNN.

Key wordsAverage category similarity      Vector Space Model(VSM)      KNN      Text categorization      Feature selection     
Received: 10 March 2014      Published: 20 October 2014
Tan Xueqing, Zhou Tong, Luo Lin. A Text Classification Algorithm Based on the Average Category Similarity. New Technology of Library and Information Service, 2014, 30(9): 66-73.

