|
|
Feature Weighting Method Affected by Part of Speech in Text Classification |
Lu Yonghe, Wang Hongbin |
School of Information Management, Sun Yat-Sen University, Guangzhou 510006, China |
|
|
Abstract [Objective] In order to get a higher precision, this paper is to improve the feature weighting method by introducing the effect of part of speech.[Methods]The effectiveness between introducing the part of speech into feature weighting and the classical TF-IDF is contrasted in text classification. In the approach of text classification introducing part of speech, the weights of part of speech is used forthe feature weighting calculation, and using Particle Swarm Optimization to find the best weights of the part of speech. The parallel tests all use SVM classifier.[Results] The experiment results show that the improved feature weighting method performs better than the classical TF-IDF method, and the precision of text classification achieves obvious improvement in different dimensions of feature space, and the increments are between 2% and 6%.[Limitations] Because of the lack of experimental conditions, the weights ensured in the experiment is only a result close to the best weights, it is needed to expand the scale of data and increase the number of iterations so as to get better weights.[Conclusions] Introducing part of speech into text classification can get a higher precision. The influence degree of part of speech is nouns, verbs and string in decreasing order. The modified feature weighting method is not only applicable to a particular corpus, but also the general one.
|
Received: 29 September 2014
Published: 21 May 2015
|
|
[1] Uysal A K,GunalS. The Impact ofPreprocessing on Text Classification [J]. Information Processing & Management, 2014, 50(1):104-112.
[2] Cooper W S.Getting Beyond Boole[J].Information Processing & Management, 1988,24(3):243-248.
[3] Fuhr N, Buekley C.A Probabilistic Learning Approach for Document Indexing[J].ACM Transactions on Information Systems,1991,9(3):223-248.
[4] Salton G, Lesk M E. Computer Evaluation of Indexing and Text Processing[J]. Journal of the ACM,1968,15(1):8-36.
[5] 鲁松,李晓黎,白硕.文档中词语权重计算方法的改进[J].中文信息学报,2000,14(6):8-14.(Lu Song,Li Xiaoli,Bai Shuo. An Improved Approach to Weighting Terms in Text[J].Journal of Chinese Information Processing,2000,14(6):8-14.)
[6] 熊忠阳,黎刚,陈小莉,等.文本分类中词语权重计算方法的改进与应用[J].计算机工程与应用,2008,44(5):187-189.(Xiong Zhongyang,Li Gang,Chen Xiaoli,et al. Improvement and Application to Weighting Terms Based on Text Classification[J]. Computer Engineering and Applications,2008,44(5):187-189.)
[7] Salton G,Buckley B.Term-weighting Approaches in Automatic TextRetrieval[J].Information Processing & Management,1998,24(5):513-523.
[8] Peng T, Liu L, Zou W. PU Text Classification Enhanced by Term Frequency-inverse Document Frequency-improved Weighting [J]. Concurrency and Computation: Practice and Experience, 2014, 26(3): 728-741.
[9] Kennedy J, Eberhart R. Particle Swarm Optimization[C]. In: Proceedings of IEEE International Conferenceon Neural Networks. IEEE, 1995: 1942-1948.
[10] 雷秀娟.群智能优化算法及其应用[M]. 北京:科学出版社,2012:87-109.(Lei Xiujuan.Swarm Intelligence Optimization Algorithms and Their Applications[M].Beijing:Science Press, 2012:87-109.)
[11] 李彦平,张佳骥.文本聚类中的降维技术研究[J].无线电工程,2005,35(6): 51-53.(Li Yanping,Zhang Jiaji.Feature Reduction for Document Clustering[J]. Radio Engineering of China,2005,35(6):51-53.)
[12] 胡燕,吴虎子,钟珞.中文文本分类中基于词性的特征提取方法研究[J].武汉理工大学学报,2007,29(4): 132-135.(Hu Yan,Wu Huzi,Zhong Luo. ResearchofFeatureExtraction Methods Based on Part of Speech in Chinese Documents Classification[J].Journal of Wuhan University of Technology,2007,29(4):132-135.)
[13] 李英.基于词性选择的文本预处理方法研究[J].情报科学,2009,27(5):717-719, 738.(Li Ying.Researchon the Text Pretreatment Based on Part of Speech Selection[J]. Information Science,2009,27(5):717-719,738.)
[14] 郑伟,吕建新,张建伟.文本分类中特征预抽取方法研究[J].情报科学,2011,29(1):86-88, 92.(Zheng Wei,Lv Jianxin,Zhang Jianwei. Research on Feature Preextraction Method in Text Classification[J]. Information Science,2011,29(1):86-88,92.)
[15] NLPIR汉语分词系统[EB/OL].[2013-12-23].http://ictclas.nlpir.org.(NLPIR Chinese Word Segmentation System[EB/OL].[2013-12-23].http://ictclas.nlpir.org.)
[16] 李荣陆.文本分类及其相关技术研究[D].上海:复旦大学,2005.(Li Ronglu.Research on Text Classification and Its Related Technologies[D].Shanghai:Fudan University, 2005.) |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|