Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (11): 66-72    DOI: 10.11925/infotech.1003-3513.2014.11.10
Current Issue | Archive | Adv Search |
A Method for Eliminating Noise in Text Classification Based on Category Distribution Characteristics
Li Xiangdong1,2, Ba Zhichao1, Huang Li3
1 School of Information Management, Wuhan University, Wuhan 430072, China;
2 Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China;
3 Wuhan University Library, Wuhan 430072, China
Download: PDF(571 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] In order to reduce the impact of the noisy samples on the classification performances during the construction of the training samples, this paper proposes a process method of noise based on the distribution characteristics of category data in training samples. [Methods] The method represents the degree of similarity among the documents in one category by defining Category Cluster Density, and then it conducts a normal unitary processing on the similarity distribution. Afterwards, combined with the method of statistics, the paper adopts the method of approximate confidence interval estimation to get pairs of documents that contain noisy samples. Based on the relative entropy of distribution and Category Cluster Density, the paper realizes the verification of correctness of the noisy documents recognition. [Results] The classification performances on the specialized and self-built corpus is higher than before by 1.21% to 4.83% respectively. [Limitations] The paper will expand the richness of samples and test the samples in various fields and multi-type data environments. [Conclusions] The method is feasible and it could effectively detect the noisy documents. Meanwhile, it could realize panel testing on large amount of samples at one time.

Key wordsTraining samples      Distribution of similarity      Text classification      Cluster density      Noises processing     
Received: 30 March 2014      Published: 18 December 2014
:  TP391  

Cite this article:

Li Xiangdong, Ba Zhichao, Huang Li. A Method for Eliminating Noise in Text Classification Based on Category Distribution Characteristics. New Technology of Library and Information Service, 2014, 30(11): 66-72.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2014.11.10     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2014/V30/I11/66

[1] 程传鹏. 中文网页分类的研究与实现[J]. 中原工学院学报, 2007, 18(1): 61-64. (Cheng Chuanpeng. The Study and Implementation of Chinese Web Page Classification [J]. Journal of Zhongyuan Institute of Technology, 2007, 18(1): 61-64.)
[2] 牛树梓, 程学旗, 郭嘉丰. 排序学习中数据噪音敏感度分析[J]. 中文信息学报, 2012, 26(5): 53-58, 128. (Niu Shuzi, Cheng Xueqi, Guo Jiafeng. Noise Sensitivity in Learning to Rank [J]. Journal of Chinese Information Processing, 2012, 26(5): 53-58, 128.)
[3] Vinciarelli A. Noisy Text Categorization [J]. IEEE Transac­tions on Pattern Analysis and Machine Intelligence, 2005, 27(12): 1882-1895.
[4] 宣照国, 党延忠. 文本分类中粗分类数据噪声修正的网络算法[J]. 情报学报, 2008, 27(5): 670-676. (Xuan Zhaoguo, Dang Yanzhong. Network-based Noise Revision Algorithm in Text Categorization [J]. Journal of the China Society for Scientific and Technical Information, 2008, 27(5): 670-676.)
[5] 王强, 关毅, 王晓龙. 基于特征类别属性分析的文本分类器分类噪声裁剪方法[J]. 自动化学报, 2007, 33(8): 809-816. (Wang Qiang, Guan Yi, Wang Xiaolong. A Method for Eliminating Class Noise in Text Categorization Based on Feature Class Attribute [J]. Acta Automatica Sinica, 2007, 33(8): 809-816.)
[6] 林洋港, 陈恩红.文本分类中基于概率主题模型的噪声处理方法[J].计算机工程与科学, 2010, 32(7): 89-92, 119. (Lin Yanggang, Chen Enhong. A Probabilistic Topic Model Based Noise Processing Method for Text Categorization [J]. Computer Engineering & Science, 2010, 32(7): 89-92, 119. )
[7] Yan J, Liu N, Zhang B, et al. OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization [C]. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). 2005: 122-129.
[8] Li R L, Hu Y F. Nosice Reduction to Text Categorization Based on Density for KNN [C]. In: Proceedings of the 2nd International Conference on Machine Learning and Cybernetics, Xi'an, China. IEEE, 2003: 3119-3124.
[9] Xu J, Chen C, Xu G, et al. Improving Quality of Training Data for Learning to Rank Using Click-through Data [C]. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. New York: ACM, 2010: 171-180.
[10] Carvalho V R, Elsas J L, Cohen W W, et a1. Suppressing Outliers in Pairwise Preference Rankings[C]. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM'08). New York: ACM, 2008:1487-1488.
[11] Nettleton D F, Orriols-Puig A, Fornells A. A Study of the Effect of Different Types of Noise on the Precision of Supervised Learning Techniques [J]. Artificial Intelligence Review, 2010, 33(4): 275-306.
[12] Tsivtsivadze E, Cseke B, Heskes T. Kernel Principal Component Ranking: Robust Ranking on Noisy Data[C]. In: Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). 2009: 101-113.
[13] Aslam J A, Kanoulas E, Pavlu V, et al. Document Selection Methodologies for Efficient and Effective Learningto-rank [C]. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'09). New York: ACM, 2009: 468-475.
[14] Si X, Sun M. Exploring the Concept Levels of Social Tags in Chinese Blogs [C]. In: Proceedings of the 11th Chinese Lexical Semantics Workshop, Suzhou, China. 2010.
[15] 宋玲, 马军, 连莉, 等.文档相似度综合计算研究[J].计算机工程与应用, 2006, 42(30): 160-163. (Song Ling, Ma Jun, Lian Li, et al. The Study on the Comprehensive Computation of the Documents Similarity [J]. Computer Engineering and Applications, 2006,42(30): 160-163.)
[16] 薛晓飞, 张永奎, 任晓东.基于新闻要素的新事件检测方法研究[J]. 计算机应用, 2008, 28(11):2975-2977. (Xue Xiaofei, Zhang Yongkui, Ren Xiaodong.Method Research of New Event Detection Based on News Element [J]. Journal of Computer Applications, 2008, 28(11): 2975-2977.)
[17] 吴慧. 海南省降水量的正态分布特征及正态化变换[J]. 广东气象, 2005, 27(2): 12-13. (Wu Hui.Normal Distribution Characteristics and Variation of Precipitation in Hainan Province [J]. Journal of Guangdong Meteorology, 2005, 27(2): 12-13.)
[18] 王文博. 统计学原理、方法及应用 [M]. 第2版. 西安: 西安交通大学出版社, 2010:94. (Wang Wenbo. Principle, Method and Application of Statistics [M]. The 2nd Edition. Xi'an: Xi'an Jiaotong University Press, 2010: 94.)
[19] Kullback S, Leibler R A. On Information and Sufficiency [J]. Annals of Mathematical Statistics, 1951, 22(1): 79-86.
[20] Vapnik V. The Nature of Statistical Learning Theory[M]. New York: Springer-Verlag, 1999.
[21] Hull D A. Improving Text Retrieval for the Routing Problem Using Latent Semantic Indexing [C]. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York: Springer-Verlag, 1994: 282-291.

[1] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[2] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[3] Xinlei Li,Hao Wang,Xiaomin Liu,Sanhong Deng. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[4] Liu Liu,Dongbo Wang. Identifying Interdisciplinary Social Science Research Based on Article Classification[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[5] Xiangdong Li,Tao Ruan,Kang Liu. Automatic Classification of Documents from Wikipedia[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[6] Yonghe Lu,Jinghuang Chen. Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[7] Qun Zhang, Hongjun Wang, Lunwen Wang. Classifying Short Texts with Word Embedding and LDA Model[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[8] Hu Juxiang, Lv Xueqiang, Liu Kehui. Complaint Text Classification Based on Guiding Words[J]. 现代图书情报技术, 2015, 31(7-8): 97-103.
[9] Li Xiangdong, Ba Zhichao, Huang Li. Allocation and Multi-granularity[J]. 现代图书情报技术, 2015, 31(5): 42-49.
[10] Lu Yonghe, Wang Hongbin. Feature Weighting Method Affected by Part of Speech in Text Classification[J]. 现代图书情报技术, 2015, 31(4): 18-25.
[11] Li Xiangdong, Cao Huan, Ding Cong, Huang Li. Short-text Classification Based on HowNet and Domain Keyword Set Extension[J]. 现代图书情报技术, 2015, 31(2): 31-38.
[12] Liu Huailiang, Du Kun, Qin Chunxiu. Research on Chinese Text Categorization Based on Semantic Similarity of HowNet[J]. 现代图书情报技术, 2015, 31(2): 39-45.
[13] Du Kun, Liu Huailiang, Guo Lujie. Study on the Modified Method of Feature Weighting with Complex Networks[J]. 现代图书情报技术, 2015, 31(11): 26-32.
[14] Shao Jian, Zhang Chengzhi. Automatic Acquisition of Domain Parallel Corpora from Internet[J]. 现代图书情报技术, 2014, 30(12): 36-43.
[15] Hu Yongjun, Jiang Jiaxin, Chang Huiyou. A New Method of Keywords Extraction for Chinese Short-text Classification[J]. 现代图书情报技术, 2013, (6): 42-48.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn