Please wait a minute...
Advanced Search
现代图书情报技术  2015, Vol. 31 Issue (4): 18-25     https://doi.org/10.11925/infotech.1003-3513.2015.04.03
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
文本分类中受词性影响的特征权重计算方法
路永, 王鸿滨
中山大学资讯管理学院 广州 510006
Feature Weighting Method Affected by Part of Speech in Text Classification
Lu Yonghe, Wang Hongbin
School of Information Management, Sun Yat-Sen University, Guangzhou 510006, China
全文: PDF (3636 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的]为提高分类准确率, 引入词性改进特征权重计算方法, 进而影响文本特征权重的取值。[方法]采用对比实验的方法, 将本文提出的引入词性的特征权重计算方法与传统的TF-IDF方法分组进行实验。在引入词性的特征权重计算方法中, 采用粒子群算法迭代计算最优词性权重。两组实验均采用SVM分类器进行分类。[结果]实验结果表明: 改进的权重计算方法比传统的TF-IDF方法的分类效果更好, 分类准确率在不同特征维度下都得到明显的提高, 提高幅度在2-6个百分点。[局限]由于实验条件的不足, 在使用粒子群算法寻找最优权重配比时得出的结果仅是接近最优解的配比, 需要扩大数据规模与增加迭代次数才能得出更佳的权重配比。[结论]在文本分类当中引入词性能有效提高分类准确率, 各词性权重大小的排序从高到低为名词、字符串、动词; 结合词性的权重计算方法并不只适用于某个特定的语料集, 还可以适用于一般的语料集。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
王鸿滨
路永
关键词 文本分类词性权重计算粒子群算法    
Abstract

[Objective] In order to get a higher precision, this paper is to improve the feature weighting method by introducing the effect of part of speech.[Methods]The effectiveness between introducing the part of speech into feature weighting and the classical TF-IDF is contrasted in text classification. In the approach of text classification introducing part of speech, the weights of part of speech is used forthe feature weighting calculation, and using Particle Swarm Optimization to find the best weights of the part of speech. The parallel tests all use SVM classifier.[Results] The experiment results show that the improved feature weighting method performs better than the classical TF-IDF method, and the precision of text classification achieves obvious improvement in different dimensions of feature space, and the increments are between 2% and 6%.[Limitations] Because of the lack of experimental conditions, the weights ensured in the experiment is only a result close to the best weights, it is needed to expand the scale of data and increase the number of iterations so as to get better weights.[Conclusions] Introducing part of speech into text classification can get a higher precision. The influence degree of part of speech is nouns, verbs and string in decreasing order. The modified feature weighting method is not only applicable to a particular corpus, but also the general one.

Key wordsText classification    Part of speech    Feature weighting    Particle swarm optimization
收稿日期: 2014-09-29      出版日期: 2015-05-21
:  TP391  
基金资助:

本文系国家自然科学基金项目“面向文本分类的多学科协同建模理论与实验研究”(项目编号:71373291)的研究成果之一。

通讯作者: 路永和,ORCID:0000-0002-7758-9365,E-mail:zsuluyonghe@163.com     E-mail: zsuluyonghe@163.com
作者简介: 作者贡献声明: 路永和:提出研究思路,修改论文;王鸿滨:采集、分析数据,设计研究方案,进行实验,论文撰写及最终版本修订。
引用本文:   
路永, 王鸿滨. 文本分类中受词性影响的特征权重计算方法[J]. 现代图书情报技术, 2015, 31(4): 18-25.
Lu Yonghe, Wang Hongbin. Feature Weighting Method Affected by Part of Speech in Text Classification. New Technology of Library and Information Service, 2015, 31(4): 18-25.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2015.04.03      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2015/V31/I4/18

[1] Uysal A K,GunalS. The Impact ofPreprocessing on Text Classification [J]. Information Processing & Management, 2014, 50(1):104-112.
[2] Cooper W S.Getting Beyond Boole[J].Information Processing & Management, 1988,24(3):243-248.
[3] Fuhr N, Buekley C.A Probabilistic Learning Approach for Document Indexing[J].ACM Transactions on Information Systems,1991,9(3):223-248.
[4] Salton G, Lesk M E. Computer Evaluation of Indexing and Text Processing[J]. Journal of the ACM,1968,15(1):8-36.
[5] 鲁松,李晓黎,白硕.文档中词语权重计算方法的改进[J].中文信息学报,2000,14(6):8-14.(Lu Song,Li Xiaoli,Bai Shuo. An Improved Approach to Weighting Terms in Text[J].Journal of Chinese Information Processing,2000,14(6):8-14.)
[6] 熊忠阳,黎刚,陈小莉,等.文本分类中词语权重计算方法的改进与应用[J].计算机工程与应用,2008,44(5):187-189.(Xiong Zhongyang,Li Gang,Chen Xiaoli,et al. Improvement and Application to Weighting Terms Based on Text Classification[J]. Computer Engineering and Applications,2008,44(5):187-189.)
[7] Salton G,Buckley B.Term-weighting Approaches in Automatic TextRetrieval[J].Information Processing & Management,1998,24(5):513-523.
[8] Peng T, Liu L, Zou W. PU Text Classification Enhanced by Term Frequency-inverse Document Frequency-improved Weighting [J]. Concurrency and Computation: Practice and Experience, 2014, 26(3): 728-741.
[9] Kennedy J, Eberhart R. Particle Swarm Optimization[C]. In: Proceedings of IEEE International Conferenceon Neural Networks. IEEE, 1995: 1942-1948.
[10] 雷秀娟.群智能优化算法及其应用[M]. 北京:科学出版社,2012:87-109.(Lei Xiujuan.Swarm Intelligence Optimization Algorithms and Their Applications[M].Beijing:Science Press, 2012:87-109.)
[11] 李彦平,张佳骥.文本聚类中的降维技术研究[J].无线电工程,2005,35(6): 51-53.(Li Yanping,Zhang Jiaji.Feature Reduction for Document Clustering[J]. Radio Engineering of China,2005,35(6):51-53.)
[12] 胡燕,吴虎子,钟珞.中文文本分类中基于词性的特征提取方法研究[J].武汉理工大学学报,2007,29(4): 132-135.(Hu Yan,Wu Huzi,Zhong Luo. ResearchofFeatureExtraction Methods Based on Part of Speech in Chinese Documents Classification[J].Journal of Wuhan University of Technology,2007,29(4):132-135.)
[13] 李英.基于词性选择的文本预处理方法研究[J].情报科学,2009,27(5):717-719, 738.(Li Ying.Researchon the Text Pretreatment Based on Part of Speech Selection[J]. Information Science,2009,27(5):717-719,738.)
[14] 郑伟,吕建新,张建伟.文本分类中特征预抽取方法研究[J].情报科学,2011,29(1):86-88, 92.(Zheng Wei,Lv Jianxin,Zhang Jianwei. Research on Feature Preextraction Method in Text Classification[J]. Information Science,2011,29(1):86-88,92.)
[15] NLPIR汉语分词系统[EB/OL].[2013-12-23].http://ictclas.nlpir.org.(NLPIR Chinese Word Segmentation System[EB/OL].[2013-12-23].http://ictclas.nlpir.org.)
[16] 李荣陆.文本分类及其相关技术研究[D].上海:复旦大学,2005.(Li Ronglu.Research on Text Classification and Its Related Technologies[D].Shanghai:Fudan University, 2005.)

[1] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] 陈星月, 倪丽萍, 倪志伟. 基于ELECTRA模型与词性特征的金融事件抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 36-47.
[4] 余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[5] 张琪,江川,纪有书,冯敏萱,李斌,许超,刘浏. 面向多领域先秦典籍的分词词性一体化自动标注模型构建*[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[6] 王艳, 王胡燕, 余本功. 基于多特征融合的中文文本分类研究*[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[7] 唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[8] 王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[9] 徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[10] 徐彤彤,孙华志,马春梅,姜丽芬,刘逸琛. 基于双向长效注意力特征表达的少样本文本分类模型研究*[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[11] 余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[12] 聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 *[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[13] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[14] 秦贺然,刘浏,李斌,王东波. 融入实体特征的典籍自动分类研究 *[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[15] 陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 *[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn