Abstract:In the process of feature selection, term’s weight determines whether the term can be a feature. But the weight is affected by many factors, the main factors are term’s importance, characteristics and representative. With the consideration of those factors, a new function TW (Term Weight) based on the importance of the feature and the ability of category distinguishing, is brought to be an improved method to select features. After that, experiments on the comparison between term’s CHI, IG and TW validate that TW can increase the weight of special features in a class and can decrease the weight of unimportant features. Finally, the validity of the new algorithm in feature selection is proved by the classification experiments on Chinese classification corpus by three classifiers.
路永和, 李焰锋. 多因素影响的特征选择方法[J]. 现代图书情报技术, 2013, (5): 34-39.
Lu Yonghe, Li Yanfeng. A Feature Selection Based on Consideration of Multiple Factors. New Technology of Library and Information Service, 2013, (5): 34-39.
[1] 台德艺, 王俊. 文本分类特征权重改进算法[J]. 计算机工程 , 2010, 36(9):197-199, 202.(Tai Deyi, Wang Jun. Improved Feature Weighting Algorithm for Text Categorization[J]. Computer Engineering, 2010, 36(9):197-199, 202.) [2] Shannon C E. A Mathematical Theory of Communication [J]. The Bell System Technical Journal, 1948, 27:379-423, 623-656. [3] Yang Y, Pederson J O. A Comparative Study on Feature Selection in Text Categorization[C]. In: Proceedings of the 14th International Conference on Machine Learning (ICML’ 97). San Francisco: Morgan Kaufmann Publishers Inc., 1997: 412-420. [4] 张帆, 张俊丽. 统计频率算法在文本信息过滤系统中的应用[J]. 图书情报工作 , 2009, 53(13):116-119.(Zhang Fan, Zhang Junli. A Feature Selection Method for Text Information Filtering Based on Statistical Frequency[J]. Library and Information Service, 2009, 53(13):116-119.) [5] 刘庆和, 梁正友. 一种基于信息增益的特征优化选择方法[J]. 计算机工程与应用 , 2011, 47(12):130-132, 136.(Liu Qinghe, Liang Zhengyou. Optimized Approach of Feature Selection Based on Information Gain[J]. Computer Engineering and Applications, 2011, 47(12):130-132, 136.) [6] 代六玲, 黄河燕, 陈肇雄. 中文文本分类中特征抽取方法的比较研究[J]. 中文信息学报 , 2004, 18(1):26-32.(Dai Liuling, Huang Heyan, Chen Zhaoxiong. A Comparative Study on Feature Selection in Chinese Text Categorization[J]. Journal of Chinese Information Processing, 2004, 18(1):26-32.) [7] 熊忠阳, 张鹏招, 张玉芳. 基于χ2统计的文本分类特征选择方法的研究[J]. 计算机应用 , 2008, 28(2):513-514, 518.(Xiong Zhongyang, Zhang Pengzhao, Zhang Yufang. Improved Approach to CHI in Feature Extraction[J]. Journal of Computer Applications, 2008, 28(2):513-514, 518.) [8] 王卫玲, 刘培玉, 初建崇. 一种改进的基于条件互信息的特征选择算法[J]. 计算机应用 , 2007, 27(2):433-435.(Wang Weiling, Liu Peiyu, Chu Jianchong. Improved Feature Selection Algorithm with Conditional Mutual Information[J]. Journal of Computer Applications, 2007, 27(2):433-435.) [9] Shankar S, Karypis G. A Feature Weight Adjustment Algorithm for Document Categorization[C]. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA.2000. [10] Lu Z, Liu Y, Zhao S, et al. Study on Feature Selection and Weighting Based on Synonym Merge in Text Categorization[C]. In: Proceedings of the 2nd International Conference on Future Networks (ICFN’10). 2010: 105-109. [11] Khan A, Baharudin B, Khan K. Efficient Feature Selection and Domain Relevance Term Weighting Method for Document Classification[C]. In: Proceedings of the 2nd International Conference on Computer Engineering and Applications (ICCEA’ 10). Washington, DC: IEEE Computer Society, 2010: 398-403. [12] 刘海峰, 王元元, 张学仁. 文本分类中一种改进的特征选择方法[J]. 情报科学 , 2007, 25(10):1534-1537.(Liu Haifeng, Wang Yuanyuan, Zhang Xueren. An Improved Feature Selection Method in Text Classification[J]. Information Science, 2007, 25(10):1534-1537.) [13] 赵小华, 马建芬. 文本分类算法中词语权重计算方法的改进[J]. 电脑知识与技术 , 2009, 5(36):10626-10628.(Zhao Xiaohua, Ma Jianfen. Modify the Method of Feature’s Weight in Text Classification[J]. Computer Knowledge and Technology, 2009, 5(36):10626-10628.) [14] 数据堂. 中文文本分类语料库[EB/OL]. [2011-10-30]. http://www.datatang.com/datares/detail.aspx?id=11963. (Datatang. Chinese Text Classification Corpus[EB/OL]. [2011-10-30]. http://www.datatang.com/datares/detail.aspx?id=11963.) [15] 柳培林. 基于向量空间模型的中文文本分类技术研究[D]. 大庆:大庆石油学院, 2006.(Liu Peilin. Research on Classification of Chinese Documents Based on Vector Space Model[D]. Daqing: Northeast Petroleum University, 2006.) [16] Soucy P, Mineau G W. Beyond TFIDF Weighting for Text Categorization in the Vector Space Model[C]. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI’05). San Francisco: Morgan Kaufmann Publishers Inc., 2005: 1130-1135.