Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (11): 66-72    DOI: 10.11925/infotech.1003-3513.2014.11.10
Current Issue | Archive | Adv Search |
A Method for Eliminating Noise in Text Classification Based on Category Distribution Characteristics
Li Xiangdong1,2, Ba Zhichao1, Huang Li3
1 School of Information Management, Wuhan University, Wuhan 430072, China;
2 Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China;
3 Wuhan University Library, Wuhan 430072, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] In order to reduce the impact of the noisy samples on the classification performances during the construction of the training samples, this paper proposes a process method of noise based on the distribution characteristics of category data in training samples. [Methods] The method represents the degree of similarity among the documents in one category by defining Category Cluster Density, and then it conducts a normal unitary processing on the similarity distribution. Afterwards, combined with the method of statistics, the paper adopts the method of approximate confidence interval estimation to get pairs of documents that contain noisy samples. Based on the relative entropy of distribution and Category Cluster Density, the paper realizes the verification of correctness of the noisy documents recognition. [Results] The classification performances on the specialized and self-built corpus is higher than before by 1.21% to 4.83% respectively. [Limitations] The paper will expand the richness of samples and test the samples in various fields and multi-type data environments. [Conclusions] The method is feasible and it could effectively detect the noisy documents. Meanwhile, it could realize panel testing on large amount of samples at one time.

Key wordsTraining samples      Distribution of similarity      Text classification      Cluster density      Noises processing     
Received: 30 March 2014      Published: 18 December 2014
:  TP391  

Cite this article:

Li Xiangdong, Ba Zhichao, Huang Li. A Method for Eliminating Noise in Text Classification Based on Category Distribution Characteristics. New Technology of Library and Information Service, 2014, 30(11): 66-72.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2014.11.10     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2014/V30/I11/66

[1] 程传鹏. 中文网页分类的研究与实现[J]. 中原工学院学报, 2007, 18(1): 61-64. (Cheng Chuanpeng. The Study and Implementation of Chinese Web Page Classification [J]. Journal of Zhongyuan Institute of Technology, 2007, 18(1): 61-64.)
[2] 牛树梓, 程学旗, 郭嘉丰. 排序学习中数据噪音敏感度分析[J]. 中文信息学报, 2012, 26(5): 53-58, 128. (Niu Shuzi, Cheng Xueqi, Guo Jiafeng. Noise Sensitivity in Learning to Rank [J]. Journal of Chinese Information Processing, 2012, 26(5): 53-58, 128.)
[3] Vinciarelli A. Noisy Text Categorization [J]. IEEE Transac­tions on Pattern Analysis and Machine Intelligence, 2005, 27(12): 1882-1895.
[4] 宣照国, 党延忠. 文本分类中粗分类数据噪声修正的网络算法[J]. 情报学报, 2008, 27(5): 670-676. (Xuan Zhaoguo, Dang Yanzhong. Network-based Noise Revision Algorithm in Text Categorization [J]. Journal of the China Society for Scientific and Technical Information, 2008, 27(5): 670-676.)
[5] 王强, 关毅, 王晓龙. 基于特征类别属性分析的文本分类器分类噪声裁剪方法[J]. 自动化学报, 2007, 33(8): 809-816. (Wang Qiang, Guan Yi, Wang Xiaolong. A Method for Eliminating Class Noise in Text Categorization Based on Feature Class Attribute [J]. Acta Automatica Sinica, 2007, 33(8): 809-816.)
[6] 林洋港, 陈恩红.文本分类中基于概率主题模型的噪声处理方法[J].计算机工程与科学, 2010, 32(7): 89-92, 119. (Lin Yanggang, Chen Enhong. A Probabilistic Topic Model Based Noise Processing Method for Text Categorization [J]. Computer Engineering & Science, 2010, 32(7): 89-92, 119. )
[7] Yan J, Liu N, Zhang B, et al. OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization [C]. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). 2005: 122-129.
[8] Li R L, Hu Y F. Nosice Reduction to Text Categorization Based on Density for KNN [C]. In: Proceedings of the 2nd International Conference on Machine Learning and Cybernetics, Xi'an, China. IEEE, 2003: 3119-3124.
[9] Xu J, Chen C, Xu G, et al. Improving Quality of Training Data for Learning to Rank Using Click-through Data [C]. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. New York: ACM, 2010: 171-180.
[10] Carvalho V R, Elsas J L, Cohen W W, et a1. Suppressing Outliers in Pairwise Preference Rankings[C]. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM'08). New York: ACM, 2008:1487-1488.
[11] Nettleton D F, Orriols-Puig A, Fornells A. A Study of the Effect of Different Types of Noise on the Precision of Supervised Learning Techniques [J]. Artificial Intelligence Review, 2010, 33(4): 275-306.
[12] Tsivtsivadze E, Cseke B, Heskes T. Kernel Principal Component Ranking: Robust Ranking on Noisy Data[C]. In: Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). 2009: 101-113.
[13] Aslam J A, Kanoulas E, Pavlu V, et al. Document Selection Methodologies for Efficient and Effective Learningto-rank [C]. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'09). New York: ACM, 2009: 468-475.
[14] Si X, Sun M. Exploring the Concept Levels of Social Tags in Chinese Blogs [C]. In: Proceedings of the 11th Chinese Lexical Semantics Workshop, Suzhou, China. 2010.
[15] 宋玲, 马军, 连莉, 等.文档相似度综合计算研究[J].计算机工程与应用, 2006, 42(30): 160-163. (Song Ling, Ma Jun, Lian Li, et al. The Study on the Comprehensive Computation of the Documents Similarity [J]. Computer Engineering and Applications, 2006,42(30): 160-163.)
[16] 薛晓飞, 张永奎, 任晓东.基于新闻要素的新事件检测方法研究[J]. 计算机应用, 2008, 28(11):2975-2977. (Xue Xiaofei, Zhang Yongkui, Ren Xiaodong.Method Research of New Event Detection Based on News Element [J]. Journal of Computer Applications, 2008, 28(11): 2975-2977.)
[17] 吴慧. 海南省降水量的正态分布特征及正态化变换[J]. 广东气象, 2005, 27(2): 12-13. (Wu Hui.Normal Distribution Characteristics and Variation of Precipitation in Hainan Province [J]. Journal of Guangdong Meteorology, 2005, 27(2): 12-13.)
[18] 王文博. 统计学原理、方法及应用 [M]. 第2版. 西安: 西安交通大学出版社, 2010:94. (Wang Wenbo. Principle, Method and Application of Statistics [M]. The 2nd Edition. Xi'an: Xi'an Jiaotong University Press, 2010: 94.)
[19] Kullback S, Leibler R A. On Information and Sufficiency [J]. Annals of Mathematical Statistics, 1951, 22(1): 79-86.
[20] Vapnik V. The Nature of Statistical Learning Theory[M]. New York: Springer-Verlag, 1999.
[21] Hull D A. Improving Text Retrieval for the Routing Problem Using Latent Semantic Indexing [C]. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York: Springer-Verlag, 1994: 282-291.

[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] Yu Bengong,Zhu Xiaojie,Zhang Ziwei. A Capsule Network Model for Text Classification with Multi-level Feature Extraction[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[4] Wang Yan, Wang Huyan, Yu Bengong. Chinese Text Classification with Feature Fusion[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[5] Wang Sidi,Hu Guangwei,Yang Siyu,Shi Yun. Automatic Transferring Government Website E-Mails Based on Text Classification[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[6] Xu Yuemei,Liu Yunwen,Cai Lianqiao. Predicitng Retweets of Government Microblogs with Deep-combined Features[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[7] Xu Tongtong,Sun Huazhi,Ma Chunmei,Jiang Lifen,Liu Yichen. Classification Model for Few-shot Texts Based on Bi-directional Long-term Attention Features[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[8] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[9] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[10] Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[11] Heran Qin,Liu Liu,Bin Li,Dongbo Wang. Automatic Classification of Ancient Classics with Entity Features[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[12] Guo Chen,Tianxiang Xu. Sentence Function Recognition Based on Active Learning[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[13] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[14] Zhiyong Tao,Xiaobing Li,Ying Liu,Xiaofang Liu. Classifying Short Texts with Improved-Attention Based Bidirectional Long Memory Network[J]. 数据分析与知识发现, 2019, 3(12): 21-29.
[15] Yuman Li,Zhibo Chen,Fu Xu. Classifying Texts with KACC Model[J]. 数据分析与知识发现, 2019, 3(10): 89-97.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn