Please wait a minute...
New Technology of Library and Information Service  2015, Vol. 31 Issue (3): 39-48    DOI: 10.11925/infotech.1003-3513.2015.03.06
Current Issue | Archive | Adv Search |
An Improved TF-IDF Feature Selection Based on Categorical Description
Xu Dongdong, Wu Shaobo
School of Information and Communication Engineering, Beijing Information Science and Technology University, Beijing 100101, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] Improve the text categorization accuracy by modifying the weighting approach in feature selection. [Methods] Introducing the inner and outer categorical information, and modifying the TF-IDF weighting, this paper proposes the TF-IDF-CD approach which based on the categorical description. Combining TF-IDF-CD with varied classifiers, such as NB and SVM, this paper conducts text categorization experiment in balanced corpus and unbalanced corpus respectively. At last, the accuracies of different weighting approaches are compared with TF-IDF-CD. [Results] The TF-IDF-CD performs well even when there are a less number of feature items. Compared to the TF-IDF, when combined with varied classifiers in different corpus, the TF-IDF-CD can greatly improve the average accuracies. The minimum increase is 14%, and the maximum up to 30%. Compared to the CTD approach, when combined with NB, SVM, and DT, the TF-IDF-CD could improve the the average accuracy of TC from 1% to 13%. But, in unbalanced corpus, when combined with KNN, the performance of the TF-IDF-CD is 2% lower than CTD. [Limitations] Combined with KNN classifier which is sensitive to the skew data, the TF-IDF-CD needs to be improved to resist the skew characteristics of unbalanced corpus. [Conclusions] Experiment resualts show that the TF-IDF-CD approach is effective.

Key wordsText categorization      Feature selection      TF-IDF      Categorical description     
Received: 23 August 2014      Published: 16 April 2015
:  TP391  

Cite this article:

Xu Dongdong, Wu Shaobo. An Improved TF-IDF Feature Selection Based on Categorical Description. New Technology of Library and Information Service, 2015, 31(3): 39-48.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2015.03.06     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2015/V31/I3/39

[1] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing [J]. Communications of the ACM, 1975, 18(11): 613-620.
[2] Salton G, Buckley C. Term-Weighting Approaches in Automatic Text Retrieval [J]. Information Processing & Management, 1988, 24(5): 513-523.
[3] Leopold E, Kindermann J. Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?[J]. Machine Learning, 2002, 46(1-3): 423-444.
[4] Lan M, Sung S Y, Low H B, et al. A Comparative Study on Term Weighting Schemes for Text Categorization [C]. In: Proceedings of International Joint Conference on Neural Networks. IEEE, 2005, 1: 546-551.
[5] Debole F, Sebastiani F. Supervised Term Weighting for Automated Text Categorization[A].// Text Mining and Its Applications[M]. Springer Berlin Heidelberg, 2004: 81-97.
[6] Jones K S. A Statistical Interpretation of Term Specificity and Its Application in Retrieval[J]. Journal of Documentation, 1972, 28(1): 11-21.
[7] Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval [M]. New York: ACM Press, 1999.
[8] Salton G, McGill M J. Introduction to Modern Information Retrieval [M]. New York: McGraw-Hill, 1983.
[9] Basili R, Moschitti A, Pazienza M T. A Text Classifier Based on Linguistic Processing [R/OL]. [2014-04-01]. http://www- ai.cs.uni-dortmund.de/EVENTS/IJCAI99-MLIF/papers/basili.ps.gz.
[10] How B C, Narayanan K. An Empirical Study of Feature Selection for Text Categorization Based on Term Weightage[C]. In: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence. IEEE, 2004: 599-602.
[11] Xue D, Sun M. A Study on Feature Weighting in Chinese Text Categorization [A].// Computational Linguistics and Intelligent Text Processing [M]. Springer Berlin Heidelberg, 2003: 592-601.
[12] Lan M, Tan C L, Low H B, et al. A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with Support Vector Machines [C]. In: Proceedings of the 14th International Conference on World Wide Web. New York, USA: ACM, 2005: 1032-1033.
[13] 周炎涛, 唐剑波, 王家琴. 基于信息熵的改进TFIDF特征选择算法[J]. 计算机工程与应用, 2007, 43(35): 156-158. (Zhou Yantao, Tang Jianbo, Wang Jiaqin. Improved TFIDF Feature Selection Algorithm Based on Information Entropy [J].Computer Engineering and Applications, 2007, 43(35): 156-158.)
[14] 熊忠阳, 黎刚, 陈小莉, 等. 文本分类中词语权重计算方法的改进与应用[J]. 计算机工程与应用, 2008, 44(5): 187-189. (Xiong Zhongyang, Li Gang, Chen Xiaoli, et al. Improvement and Application to Weighting Terms Based on Text Classification [J]. Computer Engineering and Applica­tions, 2008, 44(5): 187-189.)
[15] Forman G. BNS Feature Scaling: An Improved Representation over TF-IDF for SVM Text Classification[C]. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. New York: ACM, 2008: 263-270.
[16] 张保富, 施化吉, 马素琴. 基于TFIDF文本特征加权方法的改进研究[J]. 计算机应用与软件, 2011, 28(2): 17-20. (Zhang Baofu, Shi Huaji, Ma Suqin. An Improved Text Feature Weighting Algorithm Based on TFIDF [J]. Computer Applications and Software, 2011, 28(2): 17-20.)
[17] 李学明, 李海瑞, 薛亮, 等. 基于信息增益与信息熵的TFIDF算法[J]. 计算机工程, 2012, 38(8): 37-40. (Li Xueming, Li Hairui, Xue Liang, et al. TFIDF Algorithm Based on Information Gain and Information Entropy[J]. Computer Engineering, 2012, 38(8): 37-40. )
[18] 雷军程, 黄同成, 柳小文. 一种基于权重的文本特征选择方法[J]. 计算机科学, 2012, 39(7): 250-252. (Lei Juncheng, Huang Tongcheng, Liu Xiaowen. lmproved Text Feature Selection Method Based on Text Feature Weight[J]. Computer Science, 2012, 39(7): 250-252.)
[19] Liu M, Yang J. An Improvement of TFIDF Weighting in Text Categorization [J]. International Proceedings of Computer Science & Information Technology, 2012, 47: 44.
[20] 覃世安, 李法运. 文本分类中TF-IDF方法的改进研究[J]. 现代图书情报技术, 2013(10): 27-30. (Qin Shian, Li Fayun. Improved TF-IDF Method in Text Classification [J]. New Technology of Library and Information Service, 2013(10): 27-30.)
[21] 刘海峰, 于利军, 刘守生. 一种基于类别分布信息的文本特征选择模型[J]. 图书情报工作, 2013, 57(15): 137-141. (Liu Haifeng, Yu Lijun, Liu Shousheng. An Improved TF-IDF Method of Text Feature Selection Based on Category and Frequency [J]. Library and Information Service, 2013, 57(15): 137-141.)
[22] Lewis D D. Reuters-21578 Text Categorization Test Collection. Distribution 1.0 [EB/OL]. [2014-04-01]. http:// www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt.
[23] Lang K, Rennie J. 20 Newsgroups [EB/OL]. [2014-04-01]. http://www.ai.mit.edu/~jrennie/20Newsgroups/.
[24] Ohsumed [EB/OL]. [2014-04-01]. http://ir.ohsu.edu/ohsumed/.
[25] UCI Repository [EB/OL]. [2014-04-01]. http://archive.ics. uci.edu/ml/.
[26] 搜狗实验室. 文本分类语料库[EB/OL]. [2014-04-01]. http://www.sogou.com/labs/dl/c.html. (Sougou Lab. Sougou Lab Data [EB/OL]. [2014-04-01]. http://www.sogou.com/ labs/dl/c.html.)
[27] 谭松波, 王月粉. 中文文本分类语料—TanCorpV1.0 [EB/OL]. [2014-04-01]. http://www.searchforum.org.cn/tan­songbo/ corpus.htm. (Tan Songbo, Wang Yuefen. Chinese Text Classification Corpus—TanCorpV1.0[EB/OL]. [2014- 04-01]. http://www.searchforum.org.cn/tansongbo/corpus.htm.)
[28] 复旦大学. 复旦大学中文语料库[EB/OL]. [2014-04-01]. http://www.nlpir.org/download/tc-corpus-answer.rar. (Fudan University. Fudan University Text Corpus [EB/OL]. [2014- 04-01]. http://www.nlpir.org/download/tc-corpus-answer.rar.)

[1] Liang Jiaming, Zhao Jie, Zheng Peng, Huang Liushen, Ye Minqi, Dong Zhenning. Framework for Computing Trust in Online Short-Rent Platform Using Feature Selection of Images and Texts[J]. 数据分析与知识发现, 2021, 5(2): 129-140.
[2] Tang Xiaobo,Gao Hexuan. Classification of Health Questions Based on Vector Extension of Keywords[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[3] Peng Chen,Lv Xueqiang,Sun Ning,Zang Le,Jiang Zhaocai,Song Li. Building Phrase Dictionary for Defective Products with Convolutional Neural Network[J]. 数据分析与知识发现, 2020, 4(11): 112-120.
[4] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[5] Jiaming Liang,Jie Zhao,Zhou Jianlong,Zhenning Dong. Detecting Collusive Fraudulent Online Transaction with Implicit User Behaviors[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
[6] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[7] Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[8] Li Xiangdong,Gao Fan,Li Youhai. Categorizing Documents Automatically within Common Semantic Space[J]. 数据分析与知识发现, 2018, 2(9): 66-73.
[9] Wen Tingxin,Li Yangzi,Sun Jingshuang. Extracting Text Features with Improved Fruit Fly Optimization Algorithm[J]. 数据分析与知识发现, 2018, 2(5): 59-69.
[10] Feng Guoming,Zhang Xiaodong,Liu Suhui. Classifying Chinese Texts with CapsNet[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[11] Yin Cong,Zhang Liyi. Recommendation Algorithm for Post-Context Filtering Based on TF-IDF: Case Study of Catering O2O[J]. 数据分析与知识发现, 2018, 2(11): 28-36.
[12] Li Changbing,Pang Chongpeng,Li Meiping. Extracting Product Features with Weight-based Apriori Algorithm[J]. 数据分析与知识发现, 2017, 1(9): 83-89.
[13] Li Zhipeng,Li Weizhong. Feature Selection Based on Modified QPSO Algorithm[J]. 数据分析与知识发现, 2017, 1(7): 82-89.
[14] He Yue,Xiao Min,Zhang Yue. Sentiment Analysis of Trending Topics Based on Relevance[J]. 数据分析与知识发现, 2017, 1(3): 46-53.
[15] Zhang Yue,Wang Dongbo,Zhu Danhao. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn