文本分类中基于类别数据分布特性的噪声处理方法

doi:10.11925/infotech.1003-3513.2014.11.10

现代图书情报技术

2014, Vol. 30

Issue (11): 66-72 https://doi.org/10.11925/infotech.1003-3513.2014.11.10

情报分析与研究

本期目录 | 过刊浏览 | 高级检索

文本分类中基于类别数据分布特性的噪声处理方法

李湘东^1,2, 巴志超¹, 黄莉³

1 武汉大学信息管理学院武汉 430072;
2 武汉大学信息资源研究中心武汉 430072;
3 武汉大学图书馆武汉 430072

A Method for Eliminating Noise in Text Classification Based on Category Distribution Characteristics

Li Xiangdong^1,2, Ba Zhichao¹, Huang Li³

1 School of Information Management, Wuhan University, Wuhan 430072, China;
2 Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China;
3 Wuhan University Library, Wuhan 430072, China

摘要
参考文献
相关文章
Metrics

全文: PDF (571 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要

[目的] 为减少语料库中训练样本构建时因噪声样本对分类性能的影响, 提出一种基于训练样本中类别数据分布特性的文本分类噪声处理方法.[方法] 通过定义训练样本中各类别的聚类密度来表征类别下文档间的相似程度, 并对文档对相似度分布进行正态归一化处理; 采用近似置信区间估计以及统计相结合的方法获取含有噪声样本的文档对; 基于分布的相对熵和类别聚类密度实现对噪声样本识别的正确性验证.[结果] 利用该方法在公开及自建语料库中进行测试, 与噪声样本处理前相比, 分类性能平均提高1.21%至4.83%.[局限] 样本丰富度有待进一步扩展, 在多领域、多类型数据环境下对该噪声处理方法进行更全面的实验.[结论] 实验结果表明该方法是有效、可行的, 能够有效挖掘训练样本中的噪声样本, 且可一次处理批量检测, 不必事先判断各个噪声样本后再进行检测.

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	巴志超
	李湘东
	黄莉

关键词 ：训练样本, 相似度分布, 文本分类, 聚类密度, 噪声处理

Abstract：

[Objective] In order to reduce the impact of the noisy samples on the classification performances during the construction of the training samples, this paper proposes a process method of noise based on the distribution characteristics of category data in training samples. [Methods] The method represents the degree of similarity among the documents in one category by defining Category Cluster Density, and then it conducts a normal unitary processing on the similarity distribution. Afterwards, combined with the method of statistics, the paper adopts the method of approximate confidence interval estimation to get pairs of documents that contain noisy samples. Based on the relative entropy of distribution and Category Cluster Density, the paper realizes the verification of correctness of the noisy documents recognition. [Results] The classification performances on the specialized and self-built corpus is higher than before by 1.21% to 4.83% respectively. [Limitations] The paper will expand the richness of samples and test the samples in various fields and multi-type data environments. [Conclusions] The method is feasible and it could effectively detect the noisy documents. Meanwhile, it could realize panel testing on large amount of samples at one time.

Key words： Training samples Distribution of similarity Text classification Cluster density Noises processing

收稿日期: 2014-03-30 出版日期: 2014-12-18

TP391

通讯作者: 黄莉 E-mail: huangcomplete@gmail.com E-mail: huangcomplete@gmail.com

作者简介: 作者贡献声明: 李湘东: 提出命题和研究思路, 论文定稿; 巴志超: 采集和分析数据, 完成实验以及起草、撰写论文; 黄莉: 设计研究方案及分析实验结果.

引用本文:

李湘东, 巴志超, 黄莉. 文本分类中基于类别数据分布特性的噪声处理方法[J]. 现代图书情报技术, 2014, 30(11): 66-72.
Li Xiangdong, Ba Zhichao, Huang Li. A Method for Eliminating Noise in Text Classification Based on Category Distribution Characteristics. New Technology of Library and Information Service, 2014, 30(11): 66-72.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2014.11.10 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2014/V30/I11/66

[1] 程传鹏. 中文网页分类的研究与实现[J]. 中原工学院学报, 2007, 18(1): 61-64. (Cheng Chuanpeng. The Study and Implementation of Chinese Web Page Classification [J]. Journal of Zhongyuan Institute of Technology, 2007, 18(1): 61-64.)
[2] 牛树梓, 程学旗, 郭嘉丰. 排序学习中数据噪音敏感度分析[J]. 中文信息学报, 2012, 26(5): 53-58, 128. (Niu Shuzi, Cheng Xueqi, Guo Jiafeng. Noise Sensitivity in Learning to Rank [J]. Journal of Chinese Information Processing, 2012, 26(5): 53-58, 128.)
[3] Vinciarelli A. Noisy Text Categorization [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(12): 1882-1895.
[4] 宣照国, 党延忠. 文本分类中粗分类数据噪声修正的网络算法[J]. 情报学报, 2008, 27(5): 670-676. (Xuan Zhaoguo, Dang Yanzhong. Network-based Noise Revision Algorithm in Text Categorization [J]. Journal of the China Society for Scientific and Technical Information, 2008, 27(5): 670-676.)
[5] 王强, 关毅, 王晓龙. 基于特征类别属性分析的文本分类器分类噪声裁剪方法[J]. 自动化学报, 2007, 33(8): 809-816. (Wang Qiang, Guan Yi, Wang Xiaolong. A Method for Eliminating Class Noise in Text Categorization Based on Feature Class Attribute [J]. Acta Automatica Sinica, 2007, 33(8): 809-816.)
[6] 林洋港, 陈恩红.文本分类中基于概率主题模型的噪声处理方法[J].计算机工程与科学, 2010, 32(7): 89-92, 119. (Lin Yanggang, Chen Enhong. A Probabilistic Topic Model Based Noise Processing Method for Text Categorization [J]. Computer Engineering & Science, 2010, 32(7): 89-92, 119. )
[7] Yan J, Liu N, Zhang B, et al. OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization [C]. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). 2005: 122-129.
[8] Li R L, Hu Y F. Nosice Reduction to Text Categorization Based on Density for KNN [C]. In: Proceedings of the 2nd International Conference on Machine Learning and Cybernetics, Xi'an, China. IEEE, 2003: 3119-3124.
[9] Xu J, Chen C, Xu G, et al. Improving Quality of Training Data for Learning to Rank Using Click-through Data [C]. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. New York: ACM, 2010: 171-180.
[10] Carvalho V R, Elsas J L, Cohen W W, et a1. Suppressing Outliers in Pairwise Preference Rankings[C]. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM'08). New York: ACM, 2008:1487-1488.
[11] Nettleton D F, Orriols-Puig A, Fornells A. A Study of the Effect of Different Types of Noise on the Precision of Supervised Learning Techniques [J]. Artificial Intelligence Review, 2010, 33(4): 275-306.
[12] Tsivtsivadze E, Cseke B, Heskes T. Kernel Principal Component Ranking: Robust Ranking on Noisy Data[C]. In: Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). 2009: 101-113.
[13] Aslam J A, Kanoulas E, Pavlu V, et al. Document Selection Methodologies for Efficient and Effective Learningto-rank [C]. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'09). New York: ACM, 2009: 468-475.
[14] Si X, Sun M. Exploring the Concept Levels of Social Tags in Chinese Blogs [C]. In: Proceedings of the 11th Chinese Lexical Semantics Workshop, Suzhou, China. 2010.
[15] 宋玲, 马军, 连莉, 等.文档相似度综合计算研究[J].计算机工程与应用, 2006, 42(30): 160-163. (Song Ling, Ma Jun, Lian Li, et al. The Study on the Comprehensive Computation of the Documents Similarity [J]. Computer Engineering and Applications, 2006,42(30): 160-163.)
[16] 薛晓飞, 张永奎, 任晓东.基于新闻要素的新事件检测方法研究[J]. 计算机应用, 2008, 28(11):2975-2977. (Xue Xiaofei, Zhang Yongkui, Ren Xiaodong.Method Research of New Event Detection Based on News Element [J]. Journal of Computer Applications, 2008, 28(11): 2975-2977.)
[17] 吴慧. 海南省降水量的正态分布特征及正态化变换[J]. 广东气象, 2005, 27(2): 12-13. (Wu Hui.Normal Distribution Characteristics and Variation of Precipitation in Hainan Province [J]. Journal of Guangdong Meteorology, 2005, 27(2): 12-13.)
[18] 王文博. 统计学原理、方法及应用 [M]. 第2版. 西安: 西安交通大学出版社, 2010:94. (Wang Wenbo. Principle, Method and Application of Statistics [M]. The 2nd Edition. Xi'an: Xi'an Jiaotong University Press, 2010: 94.)
[19] Kullback S, Leibler R A. On Information and Sufficiency [J]. Annals of Mathematical Statistics, 1951, 22(1): 79-86.
[20] Vapnik V. The Nature of Statistical Learning Theory[M]. New York: Springer-Verlag, 1999.
[21] Hull D A. Improving Text Retrieval for the Routing Problem Using Latent Semantic Indexing [C]. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York: Springer-Verlag, 1994: 282-291.

[1]	陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2]	周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3]	余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究^*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[4]	王艳, 王胡燕, 余本功. 基于多特征融合的中文文本分类研究^*[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[5]	唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 ^*[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[6]	王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[7]	徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型^*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[8]	徐彤彤,孙华志,马春梅,姜丽芬,刘逸琛. 基于双向长效注意力特征表达的少样本文本分类模型研究^*[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[9]	余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[10]	聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 ^*[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[11]	邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 ^*[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[12]	秦贺然,刘浏,李斌,王东波. 融入实体特征的典籍自动分类研究 ^*[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[13]	陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 ^*[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[14]	余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类^*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[15]	谭章禄,王兆刚,胡翰. 一种基于χ²统计的特征分类选择方法研究^*[J]. 数据分析与知识发现, 2019, 3(2): 72-78.

Viewed

Full text

Abstract

Cited

Shared

Discussed