|
|
Improved TF-IDF Method in Text Classification |
Qin Shian, Li Fayun |
School of Public Administration and Policy, Fuzhou University, Fuzhou 350108, China |
|
|
Abstract When the count of one class is much more than another class's, the result of IDF in TF-IDF goes the wrong way according to its design idea. This paper solves the problem by using probability to change TF-IDF algorithm. In the end, the experiment proves that the solution mentioned above is good at classifying webpage text through a simple way to cumulative sum the value of characteristic words and the speed is faster and the accuracy rate is promoted.
|
Received: 17 June 2013
Published: 04 November 2013
|
|
[1] Sebastiani F. Machine Learning in Automated Text Categorization[J]. ACM Computing Surveys (CSUR), 2002,34(1):1-47. [2] 鲁松,李晓黎,白硕. 文档中词语权重计算方法的改进[J]. 中文信息学报, 2000,14(6): 8-13.(Lu Song,Li Xiaoli,Bai Shuo.An Improved Approach to Weighting Terms in Text[J].Journal of Chinese Information Processing, 2000,14(6): 8-13.) [3] 罗欣, 夏德麟, 晏蒲柳. 基于词频差异的特征选取及改进的TF-IDF公式[J]. 计算机应用, 2005, 25(9): 2031-2033. (Luo Xin, Xia Delin,Yan Puliu. Improved Feature Selection Method and TF-IDF Formula Based on Word Frequency Differentia[J]. Journal of Computer Applications, 2005, 25(9): 2031-2033.) [4] 张保富, 施化吉, 马素琴. 基于 TFIDF文本特征加权方法的改进研究[J]. 计算机应用与软件, 2011, 28(2):17-20.(Zhang Baofu,Shi Huaji,Ma Suqin. An Improved Text Feature Weighting Algorithm Based on TFIDF[J]. Computer Applications and Software, 2011,28(2): 17-20.) [5] Forman G. BNS Feature Scaling: An Improved Representation over tf-idf for SVM Text Classification[C].In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 2008: 263-270. [6] Lan M, Tan C L, Low H B, et al. A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with Support Vector Machines[C].In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. New York, NY, USA: ACM, 2005: 1032-1033. [7] Oren N. Reexamining tf. idf Based Information Retrieval with Genetic Programming[C].In: Proceedings of the 2002 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists on Enablement Through Technology. Republic of South Africa: South African Institute for Computer Scientists and Information Technologists, 2002: 224-234. [8] Aizawa A. An Information-theoretic Perspective of tf-idf Measures[J].Information Processing and Management,2003,39(1):45-65. [9] 梁之舜, 邓集贤, 杨维权,等.概率论及数理统计[M].北京:高等教育出版社,1988.(Liang Zhishun, Deng Jixian, Yang Weiquan, et al. Probability Theory and Mathematical Statistics[M].Beijing:Higher Education Press,1988.) [10] 苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J]. 软件学报,2006,17(9): 1848-1859.(Su Jinshu,Zhang Bofeng,Xu Xin. Advances in Machine Learning Based Text Categorization[J].Journal of Software,2006,17(9):1848-1859.) [11] Zhang H P, Yu H K, Xiong D Y, et al. HHMM-based Chinese Lexical Analyzer ICTCLAS[C].In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics,2003,17:184-187. [12] 张玉芳,彭时名,吕佳. 基于文本分类TFIDF方法的改进与应用[J]. 计算机工程,2006,32(19): 76-78.(Zhang Yufang,Peng Shiming,Lv Jia. Improvement and Application of TFIDF Method Based on Text Classification[J].Computer Engineering,2006,32(19):76-78.) |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|