Please wait a minute...
New Technology of Library and Information Service  2013, Vol. 29 Issue (10): 27-30    DOI: 10.11925/infotech.1003-3513.2013.10.05
Current Issue | Archive | Adv Search |
Improved TF-IDF Method in Text Classification
Qin Shian, Li Fayun
School of Public Administration and Policy, Fuzhou University, Fuzhou 350108, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  When the count of one class is much more than another class's, the result of IDF in TF-IDF goes the wrong way according to its design idea. This paper solves the problem by using probability to change TF-IDF algorithm. In the end, the experiment proves that the solution mentioned above is good at classifying webpage text through a simple way to cumulative sum the value of characteristic words and the speed is faster and the accuracy rate is promoted.
Key wordsProbability      TF-IDF      Webpage      Text classification     
Received: 17 June 2013      Published: 04 November 2013
:  TP391  

Cite this article:

Qin Shian, Li Fayun. Improved TF-IDF Method in Text Classification. New Technology of Library and Information Service, 2013, 29(10): 27-30.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2013.10.05     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2013/V29/I10/27

[1] Sebastiani F. Machine Learning in Automated Text Categorization[J]. ACM Computing Surveys (CSUR), 2002,34(1):1-47.
[2] 鲁松,李晓黎,白硕. 文档中词语权重计算方法的改进[J]. 中文信息学报, 2000,14(6): 8-13.(Lu Song,Li Xiaoli,Bai Shuo.An Improved Approach to Weighting Terms in Text[J].Journal of Chinese Information Processing, 2000,14(6): 8-13.)
[3] 罗欣, 夏德麟, 晏蒲柳. 基于词频差异的特征选取及改进的TF-IDF公式[J]. 计算机应用, 2005, 25(9): 2031-2033. (Luo Xin, Xia Delin,Yan Puliu. Improved Feature Selection Method and TF-IDF Formula Based on Word Frequency Differentia[J]. Journal of Computer Applications, 2005, 25(9): 2031-2033.)
[4] 张保富, 施化吉, 马素琴. 基于 TFIDF文本特征加权方法的改进研究[J]. 计算机应用与软件, 2011, 28(2):17-20.(Zhang Baofu,Shi Huaji,Ma Suqin. An Improved Text Feature Weighting Algorithm Based on TFIDF[J]. Computer Applications and Software, 2011,28(2): 17-20.)
[5] Forman G. BNS Feature Scaling: An Improved Representation over tf-idf for SVM Text Classification[C].In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 2008: 263-270.
[6] Lan M, Tan C L, Low H B, et al. A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with Support Vector Machines[C].In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. New York, NY, USA: ACM, 2005: 1032-1033.
[7] Oren N. Reexamining tf. idf Based Information Retrieval with Genetic Programming[C].In: Proceedings of the 2002 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists on Enablement Through Technology. Republic of South Africa: South African Institute for Computer Scientists and Information Technologists, 2002: 224-234.
[8] Aizawa A. An Information-theoretic Perspective of tf-idf Measures[J].Information Processing and Management,2003,39(1):45-65.
[9] 梁之舜, 邓集贤, 杨维权,等.概率论及数理统计[M].北京:高等教育出版社,1988.(Liang Zhishun, Deng Jixian, Yang Weiquan, et al. Probability Theory and Mathematical Statistics[M].Beijing:Higher Education Press,1988.)
[10] 苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J]. 软件学报,2006,17(9): 1848-1859.(Su Jinshu,Zhang Bofeng,Xu Xin. Advances in Machine Learning Based Text Categorization[J].Journal of Software,2006,17(9):1848-1859.)
[11] Zhang H P, Yu H K, Xiong D Y, et al. HHMM-based Chinese Lexical Analyzer ICTCLAS[C].In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics,2003,17:184-187.
[12] 张玉芳,彭时名,吕佳. 基于文本分类TFIDF方法的改进与应用[J]. 计算机工程,2006,32(19): 76-78.(Zhang Yufang,Peng Shiming,Lv Jia. Improvement and Application of TFIDF Method Based on Text Classification[J].Computer Engineering,2006,32(19):76-78.)
[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] Yu Bengong,Zhu Xiaojie,Zhang Ziwei. A Capsule Network Model for Text Classification with Multi-level Feature Extraction[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[4] Wang Yan, Wang Huyan, Yu Bengong. Chinese Text Classification with Feature Fusion[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[5] Tang Xiaobo,Gao Hexuan. Classification of Health Questions Based on Vector Extension of Keywords[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[6] Wang Sidi,Hu Guangwei,Yang Siyu,Shi Yun. Automatic Transferring Government Website E-Mails Based on Text Classification[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[7] Xu Yuemei,Liu Yunwen,Cai Lianqiao. Predicitng Retweets of Government Microblogs with Deep-combined Features[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[8] Peng Chen,Lv Xueqiang,Sun Ning,Zang Le,Jiang Zhaocai,Song Li. Building Phrase Dictionary for Defective Products with Convolutional Neural Network[J]. 数据分析与知识发现, 2020, 4(11): 112-120.
[9] Xu Tongtong,Sun Huazhi,Ma Chunmei,Jiang Lifen,Liu Yichen. Classification Model for Few-shot Texts Based on Bi-directional Long-term Attention Features[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[10] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[11] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[12] Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[13] Heran Qin,Liu Liu,Bin Li,Dongbo Wang. Automatic Classification of Ancient Classics with Entity Features[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[14] Guo Chen,Tianxiang Xu. Sentence Function Recognition Based on Active Learning[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[15] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn