Please wait a minute...
Advanced Search
现代图书情报技术  2013, Vol. 29 Issue (10): 27-30     https://doi.org/10.11925/infotech.1003-3513.2013.10.05
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
文本分类中TF-IDF方法的改进研究
覃世安, 李法运
福州大学公共管理学院 福州 350108
Improved TF-IDF Method in Text Classification
Qin Shian, Li Fayun
School of Public Administration and Policy, Fuzhou University, Fuzhou 350108, China
全文: PDF (322 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 针对TF-IDF在待分类文本类的数量分布不均时提取特征值效果差的问题,提出使用特征值在类间出现的概率比代替特征值在类间出现的次数比以改进TF-IDF算法。实验证明利用改进后的TF-IDF方法提取网页文本特征值,并配合简单累加求和的分类器,使得网页文本分类的准确率有明显提高,且分类速度加快。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
覃世安
李法运
关键词 概率TF-IDF网页文本分类    
Abstract:When the count of one class is much more than another class's, the result of IDF in TF-IDF goes the wrong way according to its design idea. This paper solves the problem by using probability to change TF-IDF algorithm. In the end, the experiment proves that the solution mentioned above is good at classifying webpage text through a simple way to cumulative sum the value of characteristic words and the speed is faster and the accuracy rate is promoted.
Key wordsProbability    TF-IDF    Webpage    Text classification
收稿日期: 2013-06-17      出版日期: 2013-11-04
:  TP391  
引用本文:   
覃世安, 李法运. 文本分类中TF-IDF方法的改进研究[J]. 现代图书情报技术, 2013, 29(10): 27-30.
Qin Shian, Li Fayun. Improved TF-IDF Method in Text Classification. New Technology of Library and Information Service, 2013, 29(10): 27-30.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2013.10.05      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2013/V29/I10/27
[1] Sebastiani F. Machine Learning in Automated Text Categorization[J]. ACM Computing Surveys (CSUR), 2002,34(1):1-47.
[2] 鲁松,李晓黎,白硕. 文档中词语权重计算方法的改进[J]. 中文信息学报, 2000,14(6): 8-13.(Lu Song,Li Xiaoli,Bai Shuo.An Improved Approach to Weighting Terms in Text[J].Journal of Chinese Information Processing, 2000,14(6): 8-13.)
[3] 罗欣, 夏德麟, 晏蒲柳. 基于词频差异的特征选取及改进的TF-IDF公式[J]. 计算机应用, 2005, 25(9): 2031-2033. (Luo Xin, Xia Delin,Yan Puliu. Improved Feature Selection Method and TF-IDF Formula Based on Word Frequency Differentia[J]. Journal of Computer Applications, 2005, 25(9): 2031-2033.)
[4] 张保富, 施化吉, 马素琴. 基于 TFIDF文本特征加权方法的改进研究[J]. 计算机应用与软件, 2011, 28(2):17-20.(Zhang Baofu,Shi Huaji,Ma Suqin. An Improved Text Feature Weighting Algorithm Based on TFIDF[J]. Computer Applications and Software, 2011,28(2): 17-20.)
[5] Forman G. BNS Feature Scaling: An Improved Representation over tf-idf for SVM Text Classification[C].In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 2008: 263-270.
[6] Lan M, Tan C L, Low H B, et al. A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with Support Vector Machines[C].In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. New York, NY, USA: ACM, 2005: 1032-1033.
[7] Oren N. Reexamining tf. idf Based Information Retrieval with Genetic Programming[C].In: Proceedings of the 2002 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists on Enablement Through Technology. Republic of South Africa: South African Institute for Computer Scientists and Information Technologists, 2002: 224-234.
[8] Aizawa A. An Information-theoretic Perspective of tf-idf Measures[J].Information Processing and Management,2003,39(1):45-65.
[9] 梁之舜, 邓集贤, 杨维权,等.概率论及数理统计[M].北京:高等教育出版社,1988.(Liang Zhishun, Deng Jixian, Yang Weiquan, et al. Probability Theory and Mathematical Statistics[M].Beijing:Higher Education Press,1988.)
[10] 苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J]. 软件学报,2006,17(9): 1848-1859.(Su Jinshu,Zhang Bofeng,Xu Xin. Advances in Machine Learning Based Text Categorization[J].Journal of Software,2006,17(9):1848-1859.)
[11] Zhang H P, Yu H K, Xiong D Y, et al. HHMM-based Chinese Lexical Analyzer ICTCLAS[C].In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics,2003,17:184-187.
[12] 张玉芳,彭时名,吕佳. 基于文本分类TFIDF方法的改进与应用[J]. 计算机工程,2006,32(19): 76-78.(Zhang Yufang,Peng Shiming,Lv Jia. Improvement and Application of TFIDF Method Based on Text Classification[J].Computer Engineering,2006,32(19):76-78.)
[1] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] 余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[4] 王艳, 王胡燕, 余本功. 基于多特征融合的中文文本分类研究*[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[5] 唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[6] 王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[7] 徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[8] 彭郴,吕学强,孙宁,张乐,姜肇财,宋黎. 基于CNN的消费品缺陷领域词典构建方法研究*[J]. 数据分析与知识发现, 2020, 4(11): 112-120.
[9] 徐彤彤,孙华志,马春梅,姜丽芬,刘逸琛. 基于双向长效注意力特征表达的少样本文本分类模型研究*[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[10] 余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[11] 凌洪飞,欧石燕. 面向主题模型的主题自动语义标注研究综述 *[J]. 数据分析与知识发现, 2019, 3(9): 16-26.
[12] 聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 *[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[13] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[14] 秦贺然,刘浏,李斌,王东波. 融入实体特征的典籍自动分类研究 *[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[15] 陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 *[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn