Please wait a minute...
New Technology of Library and Information Service  2013, Vol. Issue (5): 34-39    DOI: 10.11925/infotech.1003-3513.2013.05.04
Current Issue | Archive | Adv Search |
A Feature Selection Based on Consideration of Multiple Factors
Lu Yonghe, Li Yanfeng
School of Information Management, Sun Yat-Sen University, Guangzhou 510006, China
Download: PDF(728 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  In the process of feature selection, term’s weight determines whether the term can be a feature. But the weight is affected by many factors, the main factors are term’s importance, characteristics and representative. With the consideration of those factors, a new function TW (Term Weight) based on the importance of the feature and the ability of category distinguishing, is brought to be an improved method to select features. After that, experiments on the comparison between term’s CHI, IG and TW validate that TW can increase the weight of special features in a class and can decrease the weight of unimportant features. Finally, the validity of the new algorithm in feature selection is proved by the classification experiments on Chinese classification corpus by three classifiers.
Key wordsText categorization      Feature selection      Class discrimination      TF-IDF     
Received: 16 April 2013      Published: 03 July 2013
:  TP391  

Cite this article:

Lu Yonghe, Li Yanfeng. A Feature Selection Based on Consideration of Multiple Factors. New Technology of Library and Information Service, 2013, (5): 34-39.

URL:     OR

[1] 台德艺, 王俊. 文本分类特征权重改进算法[J]. 计算机工程 , 2010, 36(9):197-199, 202.(Tai Deyi, Wang Jun. Improved Feature Weighting Algorithm for Text Categorization[J]. Computer Engineering, 2010, 36(9):197-199, 202.)
[2] Shannon C E. A Mathematical Theory of Communication [J]. The Bell System Technical Journal, 1948, 27:379-423, 623-656.
[3] Yang Y, Pederson J O. A Comparative Study on Feature Selection in Text Categorization[C]. In: Proceedings of the 14th International Conference on Machine Learning (ICML’ 97). San Francisco: Morgan Kaufmann Publishers Inc., 1997: 412-420.
[4] 张帆, 张俊丽. 统计频率算法在文本信息过滤系统中的应用[J]. 图书情报工作 , 2009, 53(13):116-119.(Zhang Fan, Zhang Junli. A Feature Selection Method for Text Information Filtering Based on Statistical Frequency[J]. Library and Information Service, 2009, 53(13):116-119.)
[5] 刘庆和, 梁正友. 一种基于信息增益的特征优化选择方法[J]. 计算机工程与应用 , 2011, 47(12):130-132, 136.(Liu Qinghe, Liang Zhengyou. Optimized Approach of Feature Selection Based on Information Gain[J]. Computer Engineering and Applications, 2011, 47(12):130-132, 136.)
[6] 代六玲, 黄河燕, 陈肇雄. 中文文本分类中特征抽取方法的比较研究[J]. 中文信息学报 , 2004, 18(1):26-32.(Dai Liuling, Huang Heyan, Chen Zhaoxiong. A Comparative Study on Feature Selection in Chinese Text Categorization[J]. Journal of Chinese Information Processing, 2004, 18(1):26-32.)
[7] 熊忠阳, 张鹏招, 张玉芳. 基于χ2统计的文本分类特征选择方法的研究[J]. 计算机应用 , 2008, 28(2):513-514, 518.(Xiong Zhongyang, Zhang Pengzhao, Zhang Yufang. Improved Approach to CHI in Feature Extraction[J]. Journal of Computer Applications, 2008, 28(2):513-514, 518.)
[8] 王卫玲, 刘培玉, 初建崇. 一种改进的基于条件互信息的特征选择算法[J]. 计算机应用 , 2007, 27(2):433-435.(Wang Weiling, Liu Peiyu, Chu Jianchong. Improved Feature Selection Algorithm with Conditional Mutual Information[J]. Journal of Computer Applications, 2007, 27(2):433-435.)
[9] Shankar S, Karypis G. A Feature Weight Adjustment Algorithm for Document Categorization[C]. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA.2000.
[10] Lu Z, Liu Y, Zhao S, et al. Study on Feature Selection and Weighting Based on Synonym Merge in Text Categorization[C]. In: Proceedings of the 2nd International Conference on Future Networks (ICFN’10). 2010: 105-109.
[11] Khan A, Baharudin B, Khan K. Efficient Feature Selection and Domain Relevance Term Weighting Method for Document Classification[C]. In: Proceedings of the 2nd International Conference on Computer Engineering and Applications (ICCEA’ 10). Washington, DC: IEEE Computer Society, 2010: 398-403.
[12] 刘海峰, 王元元, 张学仁. 文本分类中一种改进的特征选择方法[J]. 情报科学 , 2007, 25(10):1534-1537.(Liu Haifeng, Wang Yuanyuan, Zhang Xueren. An Improved Feature Selection Method in Text Classification[J]. Information Science, 2007, 25(10):1534-1537.)
[13] 赵小华, 马建芬. 文本分类算法中词语权重计算方法的改进[J]. 电脑知识与技术 , 2009, 5(36):10626-10628.(Zhao Xiaohua, Ma Jianfen. Modify the Method of Feature’s Weight in Text Classification[J]. Computer Knowledge and Technology, 2009, 5(36):10626-10628.)
[14] 数据堂. 中文文本分类语料库[EB/OL]. [2011-10-30]. (Datatang. Chinese Text Classification Corpus[EB/OL]. [2011-10-30].
[15] 柳培林. 基于向量空间模型的中文文本分类技术研究[D]. 大庆:大庆石油学院, 2006.(Liu Peilin. Research on Classification of Chinese Documents Based on Vector Space Model[D]. Daqing: Northeast Petroleum University, 2006.)
[16] Soucy P, Mineau G W. Beyond TFIDF Weighting for Text Categorization in the Vector Space Model[C]. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI’05). San Francisco: Morgan Kaufmann Publishers Inc., 2005: 1130-1135.
[1] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[2] Jiaming Liang,Jie Zhao,Zhou Jianlong,Zhenning Dong. Detecting Collusive Fraudulent Online Transaction with Implicit User Behaviors[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
[3] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[4] Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[5] Xiangdong Li,Fan Gao,Youhai Li. Categorizing Documents Automatically within Common Semantic Space[J]. 数据分析与知识发现, 2018, 2(9): 66-73.
[6] Tingxin Wen,Yangzi Li,Jingshuang Sun. Extracting Text Features with Improved Fruit Fly Optimization Algorithm[J]. 数据分析与知识发现, 2018, 2(5): 59-69.
[7] Guoming Feng,Xiaodong Zhang,Suhui Liu. Classifying Chinese Texts with CapsNet[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[8] Cong Yin,Liyi Zhang. Recommendation Algorithm for Post-Context Filtering Based on TF-IDF: Case Study of Catering O2O[J]. 数据分析与知识发现, 2018, 2(11): 28-36.
[9] Changbing Li,Chongpeng Pang,Meiping Li. Extracting Product Features with Weight-based Apriori Algorithm[J]. 数据分析与知识发现, 2017, 1(9): 83-89.
[10] Zhipeng Li,Weizhong Li. Feature Selection Based on Modified QPSO Algorithm[J]. 数据分析与知识发现, 2017, 1(7): 82-89.
[11] Yue He,Min Xiao,Yue Zhang. Sentiment Analysis of Trending Topics Based on Relevance[J]. 数据分析与知识发现, 2017, 1(3): 46-53.
[12] Yue Zhang,Dongbo Wang,Danhao Zhu. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[13] Xiangdong Li,Tao Ruan,Kang Liu. Automatic Classification of Documents from Wikipedia[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[14] Yonghe Lu,Jinghuang Chen. Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[15] Liu Hongguang,Ma Shuanggang,Liu Guifeng. Classifying Chinese News Texts with Denoising Auto Encoder[J]. 现代图书情报技术, 2016, 32(6): 12-19.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938