Please wait a minute...
Advanced Search
现代图书情报技术  2013, Vol. 29 Issue (10): 27-30    DOI: 10.11925/infotech.1003-3513.2013.10.05
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
文本分类中TF-IDF方法的改进研究
覃世安, 李法运
福州大学公共管理学院 福州 350108
Improved TF-IDF Method in Text Classification
Qin Shian, Li Fayun
School of Public Administration and Policy, Fuzhou University, Fuzhou 350108, China
全文: PDF(322 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 针对TF-IDF在待分类文本类的数量分布不均时提取特征值效果差的问题,提出使用特征值在类间出现的概率比代替特征值在类间出现的次数比以改进TF-IDF算法。实验证明利用改进后的TF-IDF方法提取网页文本特征值,并配合简单累加求和的分类器,使得网页文本分类的准确率有明显提高,且分类速度加快。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
覃世安
李法运
关键词 概率TF-IDF网页文本分类    
Abstract:When the count of one class is much more than another class's, the result of IDF in TF-IDF goes the wrong way according to its design idea. This paper solves the problem by using probability to change TF-IDF algorithm. In the end, the experiment proves that the solution mentioned above is good at classifying webpage text through a simple way to cumulative sum the value of characteristic words and the speed is faster and the accuracy rate is promoted.
Key wordsProbability    TF-IDF    Webpage    Text classification
收稿日期: 2013-06-17     
:  TP391  
引用本文:   
覃世安, 李法运. 文本分类中TF-IDF方法的改进研究[J]. 现代图书情报技术, 2013, 29(10): 27-30.
Qin Shian, Li Fayun. Improved TF-IDF Method in Text Classification. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2013.10.05.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2013.10.05
[1] Sebastiani F. Machine Learning in Automated Text Categorization[J]. ACM Computing Surveys (CSUR), 2002,34(1):1-47.
[2] 鲁松,李晓黎,白硕. 文档中词语权重计算方法的改进[J]. 中文信息学报, 2000,14(6): 8-13.(Lu Song,Li Xiaoli,Bai Shuo.An Improved Approach to Weighting Terms in Text[J].Journal of Chinese Information Processing, 2000,14(6): 8-13.)
[3] 罗欣, 夏德麟, 晏蒲柳. 基于词频差异的特征选取及改进的TF-IDF公式[J]. 计算机应用, 2005, 25(9): 2031-2033. (Luo Xin, Xia Delin,Yan Puliu. Improved Feature Selection Method and TF-IDF Formula Based on Word Frequency Differentia[J]. Journal of Computer Applications, 2005, 25(9): 2031-2033.)
[4] 张保富, 施化吉, 马素琴. 基于 TFIDF文本特征加权方法的改进研究[J]. 计算机应用与软件, 2011, 28(2):17-20.(Zhang Baofu,Shi Huaji,Ma Suqin. An Improved Text Feature Weighting Algorithm Based on TFIDF[J]. Computer Applications and Software, 2011,28(2): 17-20.)
[5] Forman G. BNS Feature Scaling: An Improved Representation over tf-idf for SVM Text Classification[C].In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 2008: 263-270.
[6] Lan M, Tan C L, Low H B, et al. A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with Support Vector Machines[C].In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. New York, NY, USA: ACM, 2005: 1032-1033.
[7] Oren N. Reexamining tf. idf Based Information Retrieval with Genetic Programming[C].In: Proceedings of the 2002 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists on Enablement Through Technology. Republic of South Africa: South African Institute for Computer Scientists and Information Technologists, 2002: 224-234.
[8] Aizawa A. An Information-theoretic Perspective of tf-idf Measures[J].Information Processing and Management,2003,39(1):45-65.
[9] 梁之舜, 邓集贤, 杨维权,等.概率论及数理统计[M].北京:高等教育出版社,1988.(Liang Zhishun, Deng Jixian, Yang Weiquan, et al. Probability Theory and Mathematical Statistics[M].Beijing:Higher Education Press,1988.)
[10] 苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J]. 软件学报,2006,17(9): 1848-1859.(Su Jinshu,Zhang Bofeng,Xu Xin. Advances in Machine Learning Based Text Categorization[J].Journal of Software,2006,17(9):1848-1859.)
[11] Zhang H P, Yu H K, Xiong D Y, et al. HHMM-based Chinese Lexical Analyzer ICTCLAS[C].In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics,2003,17:184-187.
[12] 张玉芳,彭时名,吕佳. 基于文本分类TFIDF方法的改进与应用[J]. 计算机工程,2006,32(19): 76-78.(Zhang Yufang,Peng Shiming,Lv Jia. Improvement and Application of TFIDF Method Based on Text Classification[J].Computer Engineering,2006,32(19):76-78.)
[1] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[2] 谭章禄,王兆刚,胡翰. 一种基于χ2统计的特征分类选择方法研究*[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[3] 张紫玄,王昊,朱立平,邓三鸿. 中国海关HS编码风险的识别研究*[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[4] 李心蕾,王昊,刘小敏,邓三鸿. 面向微博短文本分类的文本向量化方法比较研究*[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[5] 李琳,李辉. 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[6] 刘浏,王东波. 基于论文自动分类的社科类学科跨学科性研究*[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[7] 冯国明,张晓冬,刘素辉. 基于CapsNet的中文文本分类研究*[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[8] 殷聪,张李义. 基于TF-IDF的情境后过滤推荐算法研究*——以餐饮业O2O为例[J]. 数据分析与知识发现, 2018, 2(11): 28-36.
[9] 李昌兵,庞崇鹏,李美平. 基于权重的Apriori算法在文本统计特征提取方法中的应用*[J]. 数据分析与知识发现, 2017, 1(9): 83-89.
[10] 何跃,肖敏,张月. 结合话题相关性的热点话题情感倾向研究*[J]. 数据分析与知识发现, 2017, 1(3): 46-53.
[11] 李湘东,阮涛,刘康. 基于维基百科的多种类型文献自动分类研究*[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[12] 路永和,陈景煌. 混合蛙跳算法在文本分类特征选择优化中的应用*[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[13] 刘红光,马双刚,刘桂锋. 基于降噪自动编码器的中文新闻文本分类方法研究*[J]. 现代图书情报技术, 2016, 32(6): 12-19.
[14] 张群, 王红军, 王伦文. 词向量与LDA相融合的短文本分类方法*[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[15] 张策,都云程,梁然. 采用URL特征的Hub网页识别方法研究*[J]. 现代图书情报技术, 2016, 32(1): 24-31.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn