Please wait a minute...
New Technology of Library and Information Service  2015, Vol. 31 Issue (4): 18-25    DOI: 10.11925/infotech.1003-3513.2015.04.03
Current Issue | Archive | Adv Search |
Feature Weighting Method Affected by Part of Speech in Text Classification
Lu Yonghe, Wang Hongbin
School of Information Management, Sun Yat-Sen University, Guangzhou 510006, China
Download: PDF(3636 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] In order to get a higher precision, this paper is to improve the feature weighting method by introducing the effect of part of speech.[Methods]The effectiveness between introducing the part of speech into feature weighting and the classical TF-IDF is contrasted in text classification. In the approach of text classification introducing part of speech, the weights of part of speech is used forthe feature weighting calculation, and using Particle Swarm Optimization to find the best weights of the part of speech. The parallel tests all use SVM classifier.[Results] The experiment results show that the improved feature weighting method performs better than the classical TF-IDF method, and the precision of text classification achieves obvious improvement in different dimensions of feature space, and the increments are between 2% and 6%.[Limitations] Because of the lack of experimental conditions, the weights ensured in the experiment is only a result close to the best weights, it is needed to expand the scale of data and increase the number of iterations so as to get better weights.[Conclusions] Introducing part of speech into text classification can get a higher precision. The influence degree of part of speech is nouns, verbs and string in decreasing order. The modified feature weighting method is not only applicable to a particular corpus, but also the general one.

Key wordsText classification      Part of speech      Feature weighting      Particle swarm optimization     
Received: 29 September 2014      Published: 21 May 2015
:  TP391  

Cite this article:

Lu Yonghe, Wang Hongbin. Feature Weighting Method Affected by Part of Speech in Text Classification. New Technology of Library and Information Service, 2015, 31(4): 18-25.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2015.04.03     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2015/V31/I4/18

[1] Uysal A K,GunalS. The Impact ofPreprocessing on Text Classification [J]. Information Processing & Management, 2014, 50(1):104-112.
[2] Cooper W S.Getting Beyond Boole[J].Information Processing & Management, 1988,24(3):243-248.
[3] Fuhr N, Buekley C.A Probabilistic Learning Approach for Document Indexing[J].ACM Transactions on Information Systems,1991,9(3):223-248.
[4] Salton G, Lesk M E. Computer Evaluation of Indexing and Text Processing[J]. Journal of the ACM,1968,15(1):8-36.
[5] 鲁松,李晓黎,白硕.文档中词语权重计算方法的改进[J].中文信息学报,2000,14(6):8-14.(Lu Song,Li Xiaoli,Bai Shuo. An Improved Approach to Weighting Terms in Text[J].Journal of Chinese Information Processing,2000,14(6):8-14.)
[6] 熊忠阳,黎刚,陈小莉,等.文本分类中词语权重计算方法的改进与应用[J].计算机工程与应用,2008,44(5):187-189.(Xiong Zhongyang,Li Gang,Chen Xiaoli,et al. Improvement and Application to Weighting Terms Based on Text Classification[J]. Computer Engineering and Applications,2008,44(5):187-189.)
[7] Salton G,Buckley B.Term-weighting Approaches in Automatic TextRetrieval[J].Information Processing & Management,1998,24(5):513-523.
[8] Peng T, Liu L, Zou W. PU Text Classification Enhanced by Term Frequency-inverse Document Frequency-improved Weighting [J]. Concurrency and Computation: Practice and Experience, 2014, 26(3): 728-741.
[9] Kennedy J, Eberhart R. Particle Swarm Optimization[C]. In: Proceedings of IEEE International Conferenceon Neural Networks. IEEE, 1995: 1942-1948.
[10] 雷秀娟.群智能优化算法及其应用[M]. 北京:科学出版社,2012:87-109.(Lei Xiujuan.Swarm Intelligence Optimization Algorithms and Their Applications[M].Beijing:Science Press, 2012:87-109.)
[11] 李彦平,张佳骥.文本聚类中的降维技术研究[J].无线电工程,2005,35(6): 51-53.(Li Yanping,Zhang Jiaji.Feature Reduction for Document Clustering[J]. Radio Engineering of China,2005,35(6):51-53.)
[12] 胡燕,吴虎子,钟珞.中文文本分类中基于词性的特征提取方法研究[J].武汉理工大学学报,2007,29(4): 132-135.(Hu Yan,Wu Huzi,Zhong Luo. ResearchofFeatureExtraction Methods Based on Part of Speech in Chinese Documents Classification[J].Journal of Wuhan University of Technology,2007,29(4):132-135.)
[13] 李英.基于词性选择的文本预处理方法研究[J].情报科学,2009,27(5):717-719, 738.(Li Ying.Researchon the Text Pretreatment Based on Part of Speech Selection[J]. Information Science,2009,27(5):717-719,738.)
[14] 郑伟,吕建新,张建伟.文本分类中特征预抽取方法研究[J].情报科学,2011,29(1):86-88, 92.(Zheng Wei,Lv Jianxin,Zhang Jianwei. Research on Feature Preextraction Method in Text Classification[J]. Information Science,2011,29(1):86-88,92.)
[15] NLPIR汉语分词系统[EB/OL].[2013-12-23].http://ictclas.nlpir.org.(NLPIR Chinese Word Segmentation System[EB/OL].[2013-12-23].http://ictclas.nlpir.org.)
[16] 李荣陆.文本分类及其相关技术研究[D].上海:复旦大学,2005.(Li Ronglu.Research on Text Classification and Its Related Technologies[D].Shanghai:Fudan University, 2005.)

[1] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[2] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[3] Xinlei Li,Hao Wang,Xiaomin Liu,Sanhong Deng. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[4] Liu Liu,Dongbo Wang. Identifying Interdisciplinary Social Science Research Based on Article Classification[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[5] Chunxia Zhan,Rongbo Wang,Xiaoxi Huang,Zhiqun Chen. Application of Text Clustering Method Based on Improved CFSFDP Algorithm[J]. 数据分析与知识发现, 2017, 1(4): 94-99.
[6] Changyuan Gao,Jianping Yu,Xiaoyan He. Knowledge Search for Cloud Computing Industry Alliance: An Algorithm Based on Improved Particle Swarm Optimization[J]. 数据分析与知识发现, 2017, 1(3): 81-89.
[7] Xiangdong Li,Tao Ruan,Kang Liu. Automatic Classification of Documents from Wikipedia[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[8] Yonghe Lu,Jinghuang Chen. Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[9] Qun Zhang, Hongjun Wang, Lunwen Wang. Classifying Short Texts with Word Embedding and LDA Model[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[10] Yin Xihong, Qiao Xiaodong, Zhang Yunliang, Li Guoshuang. Fuzzy Classification Method Based on Particle Swarm Optimization and Fuzzy Comprehensive Evaluation[J]. 现代图书情报技术, 2015, 31(9): 46-51.
[11] Hu Juxiang, Lv Xueqiang, Liu Kehui. Complaint Text Classification Based on Guiding Words[J]. 现代图书情报技术, 2015, 31(7-8): 97-103.
[12] Li Xiangdong, Ba Zhichao, Huang Li. Allocation and Multi-granularity[J]. 现代图书情报技术, 2015, 31(5): 42-49.
[13] Li Xiangdong, Cao Huan, Ding Cong, Huang Li. Short-text Classification Based on HowNet and Domain Keyword Set Extension[J]. 现代图书情报技术, 2015, 31(2): 31-38.
[14] Liu Huailiang, Du Kun, Qin Chunxiu. Research on Chinese Text Categorization Based on Semantic Similarity of HowNet[J]. 现代图书情报技术, 2015, 31(2): 39-45.
[15] Du Kun, Liu Huailiang, Guo Lujie. Study on the Modified Method of Feature Weighting with Complex Networks[J]. 现代图书情报技术, 2015, 31(11): 26-32.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn