|
|
A Method for Eliminating Noise in Text Classification Based on Category Distribution Characteristics |
Li Xiangdong1,2, Ba Zhichao1, Huang Li3 |
1 School of Information Management, Wuhan University, Wuhan 430072, China;
2 Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China;
3 Wuhan University Library, Wuhan 430072, China |
|
|
Abstract [Objective] In order to reduce the impact of the noisy samples on the classification performances during the construction of the training samples, this paper proposes a process method of noise based on the distribution characteristics of category data in training samples. [Methods] The method represents the degree of similarity among the documents in one category by defining Category Cluster Density, and then it conducts a normal unitary processing on the similarity distribution. Afterwards, combined with the method of statistics, the paper adopts the method of approximate confidence interval estimation to get pairs of documents that contain noisy samples. Based on the relative entropy of distribution and Category Cluster Density, the paper realizes the verification of correctness of the noisy documents recognition. [Results] The classification performances on the specialized and self-built corpus is higher than before by 1.21% to 4.83% respectively. [Limitations] The paper will expand the richness of samples and test the samples in various fields and multi-type data environments. [Conclusions] The method is feasible and it could effectively detect the noisy documents. Meanwhile, it could realize panel testing on large amount of samples at one time.
|
Received: 30 March 2014
Published: 18 December 2014
|
|
[1] 程传鹏. 中文网页分类的研究与实现[J]. 中原工学院学报, 2007, 18(1): 61-64. (Cheng Chuanpeng. The Study and Implementation of Chinese Web Page Classification [J]. Journal of Zhongyuan Institute of Technology, 2007, 18(1): 61-64.)
[2] 牛树梓, 程学旗, 郭嘉丰. 排序学习中数据噪音敏感度分析[J]. 中文信息学报, 2012, 26(5): 53-58, 128. (Niu Shuzi, Cheng Xueqi, Guo Jiafeng. Noise Sensitivity in Learning to Rank [J]. Journal of Chinese Information Processing, 2012, 26(5): 53-58, 128.)
[3] Vinciarelli A. Noisy Text Categorization [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(12): 1882-1895.
[4] 宣照国, 党延忠. 文本分类中粗分类数据噪声修正的网络算法[J]. 情报学报, 2008, 27(5): 670-676. (Xuan Zhaoguo, Dang Yanzhong. Network-based Noise Revision Algorithm in Text Categorization [J]. Journal of the China Society for Scientific and Technical Information, 2008, 27(5): 670-676.)
[5] 王强, 关毅, 王晓龙. 基于特征类别属性分析的文本分类器分类噪声裁剪方法[J]. 自动化学报, 2007, 33(8): 809-816. (Wang Qiang, Guan Yi, Wang Xiaolong. A Method for Eliminating Class Noise in Text Categorization Based on Feature Class Attribute [J]. Acta Automatica Sinica, 2007, 33(8): 809-816.)
[6] 林洋港, 陈恩红.文本分类中基于概率主题模型的噪声处理方法[J].计算机工程与科学, 2010, 32(7): 89-92, 119. (Lin Yanggang, Chen Enhong. A Probabilistic Topic Model Based Noise Processing Method for Text Categorization [J]. Computer Engineering & Science, 2010, 32(7): 89-92, 119. )
[7] Yan J, Liu N, Zhang B, et al. OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization [C]. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). 2005: 122-129.
[8] Li R L, Hu Y F. Nosice Reduction to Text Categorization Based on Density for KNN [C]. In: Proceedings of the 2nd International Conference on Machine Learning and Cybernetics, Xi'an, China. IEEE, 2003: 3119-3124.
[9] Xu J, Chen C, Xu G, et al. Improving Quality of Training Data for Learning to Rank Using Click-through Data [C]. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. New York: ACM, 2010: 171-180.
[10] Carvalho V R, Elsas J L, Cohen W W, et a1. Suppressing Outliers in Pairwise Preference Rankings[C]. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM'08). New York: ACM, 2008:1487-1488.
[11] Nettleton D F, Orriols-Puig A, Fornells A. A Study of the Effect of Different Types of Noise on the Precision of Supervised Learning Techniques [J]. Artificial Intelligence Review, 2010, 33(4): 275-306.
[12] Tsivtsivadze E, Cseke B, Heskes T. Kernel Principal Component Ranking: Robust Ranking on Noisy Data[C]. In: Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). 2009: 101-113.
[13] Aslam J A, Kanoulas E, Pavlu V, et al. Document Selection Methodologies for Efficient and Effective Learningto-rank [C]. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'09). New York: ACM, 2009: 468-475.
[14] Si X, Sun M. Exploring the Concept Levels of Social Tags in Chinese Blogs [C]. In: Proceedings of the 11th Chinese Lexical Semantics Workshop, Suzhou, China. 2010.
[15] 宋玲, 马军, 连莉, 等.文档相似度综合计算研究[J].计算机工程与应用, 2006, 42(30): 160-163. (Song Ling, Ma Jun, Lian Li, et al. The Study on the Comprehensive Computation of the Documents Similarity [J]. Computer Engineering and Applications, 2006,42(30): 160-163.)
[16] 薛晓飞, 张永奎, 任晓东.基于新闻要素的新事件检测方法研究[J]. 计算机应用, 2008, 28(11):2975-2977. (Xue Xiaofei, Zhang Yongkui, Ren Xiaodong.Method Research of New Event Detection Based on News Element [J]. Journal of Computer Applications, 2008, 28(11): 2975-2977.)
[17] 吴慧. 海南省降水量的正态分布特征及正态化变换[J]. 广东气象, 2005, 27(2): 12-13. (Wu Hui.Normal Distribution Characteristics and Variation of Precipitation in Hainan Province [J]. Journal of Guangdong Meteorology, 2005, 27(2): 12-13.)
[18] 王文博. 统计学原理、方法及应用 [M]. 第2版. 西安: 西安交通大学出版社, 2010:94. (Wang Wenbo. Principle, Method and Application of Statistics [M]. The 2nd Edition. Xi'an: Xi'an Jiaotong University Press, 2010: 94.)
[19] Kullback S, Leibler R A. On Information and Sufficiency [J]. Annals of Mathematical Statistics, 1951, 22(1): 79-86.
[20] Vapnik V. The Nature of Statistical Learning Theory[M]. New York: Springer-Verlag, 1999.
[21] Hull D A. Improving Text Retrieval for the Routing Problem Using Latent Semantic Indexing [C]. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York: Springer-Verlag, 1994: 282-291. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|