New Technology of Library and Information Service  2014, Vol. 30 Issue (11): 66-72    DOI: 10.11925/infotech.1003-3513.2014.11.10
A Method for Eliminating Noise in Text Classification Based on Category Distribution Characteristics
Li Xiangdong1,2, Ba Zhichao1, Huang Li3
1 School of Information Management, Wuhan University, Wuhan 430072, China;
2 Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China;
3 Wuhan University Library, Wuhan 430072, China
[Objective] In order to reduce the impact of the noisy samples on the classification performances during the construction of the training samples, this paper proposes a process method of noise based on the distribution characteristics of category data in training samples. [Methods] The method represents the degree of similarity among the documents in one category by defining Category Cluster Density, and then it conducts a normal unitary processing on the similarity distribution. Afterwards, combined with the method of statistics, the paper adopts the method of approximate confidence interval estimation to get pairs of documents that contain noisy samples. Based on the relative entropy of distribution and Category Cluster Density, the paper realizes the verification of correctness of the noisy documents recognition. [Results] The classification performances on the specialized and self-built corpus is higher than before by 1.21% to 4.83% respectively. [Limitations] The paper will expand the richness of samples and test the samples in various fields and multi-type data environments. [Conclusions] The method is feasible and it could effectively detect the noisy documents. Meanwhile, it could realize panel testing on large amount of samples at one time.

Key wordsTraining samples      Distribution of similarity      Text classification      Cluster density      Noises processing     
Received: 30 March 2014      Published: 18 December 2014
PACS:  TP391  

Cite this article:

Li Xiangdong, Ba Zhichao, Huang Li. A Method for Eliminating Noise in Text Classification Based on Category Distribution Characteristics. New Technology of Library and Information Service, 2014, 30(11): 66-72.

