Please wait a minute...
Advanced Search
现代图书情报技术  2015, Vol. 31 Issue (3): 39-48     https://doi.org/10.11925/infotech.1003-3513.2015.03.06
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
一种基于类别描述的TF-IDF特征选择方法的改进
徐冬冬, 吴韶波
北京信息科技大学信息与通信工程学院 北京 100101
An Improved TF-IDF Feature Selection Based on Categorical Description
Xu Dongdong, Wu Shaobo
School of Information and Communication Engineering, Beijing Information Science and Technology University, Beijing 100101, China
全文: PDF (1168 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的]对特征权重公式进行改进, 提高文本分类精度。[方法]引入类内、类间信息并修正TF-IDF权重因子, 得到基于类别描述的TF-IDF-CD方法。将其在偏斜文本集和均衡文本集下分别与NB、KNN等分类方法结合进行文本分类实验, 比较其与TF-IDF、CTD等方法的分类精确度。[结果]TF-IDF-CD方法在特征项较少时已有很好分类效果。相比TF-IDF, 在不同文本集以及不同分类方法下, 其平均分类精度均有大幅提高, 最低为14%, 最高可达30%。与CTD相比, TF-IDF-CD与NB、SVM及DT结合后的平均分类精度均有1%-13%的提高。而在非均衡文本集下, TF-IDF-CD与KNN结合时其性能比CTD与KNN结合时低2%。[局限]TF-IDF-CD与对文本集不均衡性较敏感的KNN结合时, 其抗数据偏斜能力仍需改善。[结论]实验结果表明, TF-IDF-CD特征选择方法有效, 对TF-IDF的改进具有一定借鉴意义。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
吴韶波
徐冬冬
关键词 文本分类特征选择TF-IDF类别描述    
Abstract

[Objective] Improve the text categorization accuracy by modifying the weighting approach in feature selection. [Methods] Introducing the inner and outer categorical information, and modifying the TF-IDF weighting, this paper proposes the TF-IDF-CD approach which based on the categorical description. Combining TF-IDF-CD with varied classifiers, such as NB and SVM, this paper conducts text categorization experiment in balanced corpus and unbalanced corpus respectively. At last, the accuracies of different weighting approaches are compared with TF-IDF-CD. [Results] The TF-IDF-CD performs well even when there are a less number of feature items. Compared to the TF-IDF, when combined with varied classifiers in different corpus, the TF-IDF-CD can greatly improve the average accuracies. The minimum increase is 14%, and the maximum up to 30%. Compared to the CTD approach, when combined with NB, SVM, and DT, the TF-IDF-CD could improve the the average accuracy of TC from 1% to 13%. But, in unbalanced corpus, when combined with KNN, the performance of the TF-IDF-CD is 2% lower than CTD. [Limitations] Combined with KNN classifier which is sensitive to the skew data, the TF-IDF-CD needs to be improved to resist the skew characteristics of unbalanced corpus. [Conclusions] Experiment resualts show that the TF-IDF-CD approach is effective.

Key wordsText categorization    Feature selection    TF-IDF    Categorical description
收稿日期: 2014-08-23      出版日期: 2015-04-16
:  TP391  
基金资助:

本文系北京市教委科技发展计划基金项目"云计算模式下移动互联网动态云安全关键技术研究"(项目编号:KM201311232010)、国家自然科学基金项目"基于资源标签交换的无线网络端到端能效管理策略研究"(项目编号:61271198)和国家自然科学基金项目"LTE-A飞蜂窝系统的动态资源分配与性能评价研究"(项目编号:61370065)的研究成果之一。

通讯作者: 徐冬冬, ORCID: 0000-0001-6168-1514 , E-mail: dongdongxu@foxmail.com。     E-mail: dongdongxu@foxmail.com
作者简介: 作者贡献声明: 徐冬冬:设计研究方案,进行实验,撰写论文;吴韶波:提出研究思路,设计论文框架,论文修订。
引用本文:   
徐冬冬, 吴韶波. 一种基于类别描述的TF-IDF特征选择方法的改进[J]. 现代图书情报技术, 2015, 31(3): 39-48.
Xu Dongdong, Wu Shaobo. An Improved TF-IDF Feature Selection Based on Categorical Description. New Technology of Library and Information Service, 2015, 31(3): 39-48.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2015.03.06      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2015/V31/I3/39

[1] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing [J]. Communications of the ACM, 1975, 18(11): 613-620.
[2] Salton G, Buckley C. Term-Weighting Approaches in Automatic Text Retrieval [J]. Information Processing & Management, 1988, 24(5): 513-523.
[3] Leopold E, Kindermann J. Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?[J]. Machine Learning, 2002, 46(1-3): 423-444.
[4] Lan M, Sung S Y, Low H B, et al. A Comparative Study on Term Weighting Schemes for Text Categorization [C]. In: Proceedings of International Joint Conference on Neural Networks. IEEE, 2005, 1: 546-551.
[5] Debole F, Sebastiani F. Supervised Term Weighting for Automated Text Categorization[A].// Text Mining and Its Applications[M]. Springer Berlin Heidelberg, 2004: 81-97.
[6] Jones K S. A Statistical Interpretation of Term Specificity and Its Application in Retrieval[J]. Journal of Documentation, 1972, 28(1): 11-21.
[7] Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval [M]. New York: ACM Press, 1999.
[8] Salton G, McGill M J. Introduction to Modern Information Retrieval [M]. New York: McGraw-Hill, 1983.
[9] Basili R, Moschitti A, Pazienza M T. A Text Classifier Based on Linguistic Processing [R/OL]. [2014-04-01]. http://www- ai.cs.uni-dortmund.de/EVENTS/IJCAI99-MLIF/papers/basili.ps.gz.
[10] How B C, Narayanan K. An Empirical Study of Feature Selection for Text Categorization Based on Term Weightage[C]. In: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence. IEEE, 2004: 599-602.
[11] Xue D, Sun M. A Study on Feature Weighting in Chinese Text Categorization [A].// Computational Linguistics and Intelligent Text Processing [M]. Springer Berlin Heidelberg, 2003: 592-601.
[12] Lan M, Tan C L, Low H B, et al. A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with Support Vector Machines [C]. In: Proceedings of the 14th International Conference on World Wide Web. New York, USA: ACM, 2005: 1032-1033.
[13] 周炎涛, 唐剑波, 王家琴. 基于信息熵的改进TFIDF特征选择算法[J]. 计算机工程与应用, 2007, 43(35): 156-158. (Zhou Yantao, Tang Jianbo, Wang Jiaqin. Improved TFIDF Feature Selection Algorithm Based on Information Entropy [J].Computer Engineering and Applications, 2007, 43(35): 156-158.)
[14] 熊忠阳, 黎刚, 陈小莉, 等. 文本分类中词语权重计算方法的改进与应用[J]. 计算机工程与应用, 2008, 44(5): 187-189. (Xiong Zhongyang, Li Gang, Chen Xiaoli, et al. Improvement and Application to Weighting Terms Based on Text Classification [J]. Computer Engineering and Applica­tions, 2008, 44(5): 187-189.)
[15] Forman G. BNS Feature Scaling: An Improved Representation over TF-IDF for SVM Text Classification[C]. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. New York: ACM, 2008: 263-270.
[16] 张保富, 施化吉, 马素琴. 基于TFIDF文本特征加权方法的改进研究[J]. 计算机应用与软件, 2011, 28(2): 17-20. (Zhang Baofu, Shi Huaji, Ma Suqin. An Improved Text Feature Weighting Algorithm Based on TFIDF [J]. Computer Applications and Software, 2011, 28(2): 17-20.)
[17] 李学明, 李海瑞, 薛亮, 等. 基于信息增益与信息熵的TFIDF算法[J]. 计算机工程, 2012, 38(8): 37-40. (Li Xueming, Li Hairui, Xue Liang, et al. TFIDF Algorithm Based on Information Gain and Information Entropy[J]. Computer Engineering, 2012, 38(8): 37-40. )
[18] 雷军程, 黄同成, 柳小文. 一种基于权重的文本特征选择方法[J]. 计算机科学, 2012, 39(7): 250-252. (Lei Juncheng, Huang Tongcheng, Liu Xiaowen. lmproved Text Feature Selection Method Based on Text Feature Weight[J]. Computer Science, 2012, 39(7): 250-252.)
[19] Liu M, Yang J. An Improvement of TFIDF Weighting in Text Categorization [J]. International Proceedings of Computer Science & Information Technology, 2012, 47: 44.
[20] 覃世安, 李法运. 文本分类中TF-IDF方法的改进研究[J]. 现代图书情报技术, 2013(10): 27-30. (Qin Shian, Li Fayun. Improved TF-IDF Method in Text Classification [J]. New Technology of Library and Information Service, 2013(10): 27-30.)
[21] 刘海峰, 于利军, 刘守生. 一种基于类别分布信息的文本特征选择模型[J]. 图书情报工作, 2013, 57(15): 137-141. (Liu Haifeng, Yu Lijun, Liu Shousheng. An Improved TF-IDF Method of Text Feature Selection Based on Category and Frequency [J]. Library and Information Service, 2013, 57(15): 137-141.)
[22] Lewis D D. Reuters-21578 Text Categorization Test Collection. Distribution 1.0 [EB/OL]. [2014-04-01]. http:// www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt.
[23] Lang K, Rennie J. 20 Newsgroups [EB/OL]. [2014-04-01]. http://www.ai.mit.edu/~jrennie/20Newsgroups/.
[24] Ohsumed [EB/OL]. [2014-04-01]. http://ir.ohsu.edu/ohsumed/.
[25] UCI Repository [EB/OL]. [2014-04-01]. http://archive.ics. uci.edu/ml/.
[26] 搜狗实验室. 文本分类语料库[EB/OL]. [2014-04-01]. http://www.sogou.com/labs/dl/c.html. (Sougou Lab. Sougou Lab Data [EB/OL]. [2014-04-01]. http://www.sogou.com/ labs/dl/c.html.)
[27] 谭松波, 王月粉. 中文文本分类语料—TanCorpV1.0 [EB/OL]. [2014-04-01]. http://www.searchforum.org.cn/tan­songbo/ corpus.htm. (Tan Songbo, Wang Yuefen. Chinese Text Classification Corpus—TanCorpV1.0[EB/OL]. [2014- 04-01]. http://www.searchforum.org.cn/tansongbo/corpus.htm.)
[28] 复旦大学. 复旦大学中文语料库[EB/OL]. [2014-04-01]. http://www.nlpir.org/download/tc-corpus-answer.rar. (Fudan University. Fudan University Text Corpus [EB/OL]. [2014- 04-01]. http://www.nlpir.org/download/tc-corpus-answer.rar.)

[1] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] 余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[4] 梁家铭, 赵洁, 郑鹏, 黄流深, 叶敏祺, 董振宁. 特征选择下融合图像和文本分析的在线短租平台信任计算框架 *[J]. 数据分析与知识发现, 2021, 5(2): 129-140.
[5] 王艳, 王胡燕, 余本功. 基于多特征融合的中文文本分类研究*[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[6] 唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[7] 王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[8] 徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[9] 彭郴,吕学强,孙宁,张乐,姜肇财,宋黎. 基于CNN的消费品缺陷领域词典构建方法研究*[J]. 数据分析与知识发现, 2020, 4(11): 112-120.
[10] 徐彤彤,孙华志,马春梅,姜丽芬,刘逸琛. 基于双向长效注意力特征表达的少样本文本分类模型研究*[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[11] 余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[12] 聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 *[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[13] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[14] 秦贺然,刘浏,李斌,王东波. 融入实体特征的典籍自动分类研究 *[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[15] 陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 *[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn