Please wait a minute...
Advanced Search
数据分析与知识发现  2017, Vol. 1 Issue (1): 91-101     https://doi.org/10.11925/infotech.2096-3467.2017.01.11
  应用论文 本期目录 | 过刊浏览 | 高级检索 |
混合蛙跳算法在文本分类特征选择优化中的应用*
路永和(), 陈景煌
中山大学资讯管理学院 广州 510006
Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm
Lu Yonghe(), Chen Jinghuang
School of Information Management, Sun Yat-Sen University, Guangzhou 510006, China
全文: PDF (1084 KB)   HTML ( 44
输出: BibTeX | EndNote (RIS)      
摘要 

目的】由于文本数据存在许多与分类不相关的冗余词项, 引入混合蛙跳算法进行特征选择优化, 提高分类准确率。【方法】分别使用CHI和IG预选出不同维度的特征集合, 再引入改进后的混合蛙跳算法对预选特征集合进行二次优选, 每只青蛙的位置代表一种特征选择规则, 将分类准确率作为算法的适应度函数。SVM和KNN分类器用于实验中分类准确率的计算。【结果】引入改进后的蛙跳算法比CHI和IG能得到更好的分类效果,最大提升幅度达到12%。【局限】在少部分特征维度下出现过拟合现象。【结论】采用特征词预选和改进后的蛙跳算法相结合的特征选择优化方法可以有效排除部分噪声特征项的干扰, 从而提高文本分类准确率。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
路永和
陈景煌
关键词 特征选择文本分类混合蛙跳算法    
Abstract

[Objective]This paper introduces the shuffled frog leaping algorithm (SFLA) to remove the irrelevant terms from the texts, and optimizes the feature selection method to improve the accuracy of text classification. [Methods] First, we used CHI and IG techniques to pre-select different dimensions of feature terms, and then adopted the modified SFLA to refine the text features’ list. Second, we used a frog to represent a feature selection rule, and applied the classification precision as the fitness function. Finally, the SVM and KNN classifier were adopted to calculate the classification precision. [Results] The modified SFLA had better performance in classification precision than CHI and IG, and the highest increasing rate was 12%. [Limitations] The feature over fitting occured in small portion of space dimensions. [Conclusions] Using feature preselection and the modified SFLA could effectively exclude irrelevant or invalid terms, and then improve the precision of feature selection.

Key wordsFeature Selection    Text Classification    Shuffled Frog Leaping Algorithm
收稿日期: 2016-09-30      出版日期: 2017-02-22
ZTFLH:  TP391  
基金资助:*本文系国家自然科学基金项目“面向文本分类的多学科协同建模理论与实验研究”(项目编号: 71373291)和广东省科技计划项目“面向主题的中文语料库构建方法与技术”(项目编号: 2015A030401037)的研究成果之一
引用本文:   
路永和, 陈景煌. 混合蛙跳算法在文本分类特征选择优化中的应用*[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
Lu Yonghe,Chen Jinghuang. Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm. Data Analysis and Knowledge Discovery, 2017, 1(1): 91-101.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2017.01.11      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2017/V1/I1/91
  SFLA的个体进化方式改进流程图
  改进后的SFLA的特征选择优化方法流程图
  实验流程图
类别 acq crude earn grain interest money-fx ship trade 总数
训练集 1 596 253 2 840 41 190 206 108 251 5 485
大测
试集
696 121 1 083 10 81 87 36 75 2 189
  Reuters-21578语料类别分布表
特征选择
方法
维数
CHI
(%)
CHI_SFLA(%) IG
(%)
IG_SFLA
(%)
100 93.102 92.143 90.132 90.772
200 93.878 92.965 91.366 92.873
300 92.554 92.005 89.082 92.736
400 91.000 94.381 86.249 92.873
500 90.726 94.153 85.381 92.325
600 87.848 92.599 84.651 92.645
700 85.975 93.878 83.919 92.462
800 85.244 93.970 83.645 92.234
900 84.513 93.878 83.326 91.594
1 000 84.011 93.559 82.914 91.640
1 100 83.646 94.107 82.686 93.376
1 200 83.189 94.290 82.412 92.828
  SVM分类器下Reuters-21578各个特征选择方法的分类准确率
  SVM分类器下Reuters-21578英文语料库的分类准确率(CHI)
  SVM分类器下Reuters-21578英文语料库的分类准确率(IG)
特征选择
方法
维数
CHI
(%)
CHI_SFLA
(%)
IG
(%)
IG_SFLA
(%)
100 90.361 91.914 87.391 90.955
200 88.305 90.909 89.356 90.452
300 87.483 91.275 89.082 90.361
400 86.752 89.630 89.676 89.676
500 87.300 91.366 88.305 88.716
600 87.163 91.594 87.483 89.402
700 86.661 91.138 87.117 89.630
800 85.564 88.671 86.341 89.950
900 84.742 88.031 86.067 89.676
1 000 83.920 88.077 85.062 89.127
1 100 81.361 87.803 84.376 89.493
1 200 81.635 87.163 83.919 89.721
  KNN分类器下Reuters-21578各个特征选择方法的分类准确率
  KNN分类器下Reuters-21578英文语料库的分类准确率(CHI)
  KNN分类器下Reuters-21578英文语料库的分类准确率(IG)
特征选择
方法
维数
CHI
(%)
CHI_SFLA
(%)
IG
(%)
IG_SFLA
(%)
100 77.042 77.417 55.667 56.958
200 83.292 85.792 68.667 76.333
300 80.833 86.083 73.833 83.083
400 77.458 84.625 77.083 79.000
500 78.875 85.708 78.708 80.292
600 80.583 86.167 80.083 83.458
700 80.417 86.208 81.167 84.625
800 80.375 85.333 81.833 86.250
900 80.667 85.958 81.417 84.708
1 000 80.750 87.292 81.167 86.667
1 100 80.583 84.667 80.500 82.125
1 200 80.208 86.042 80.250 83.250
  SVM分类器下实验室语料库各个特征选择方法的分类准确率
  SVM分类器下实验室语料库的CHI_SFLA和CHI的分类准确率
  SVM分类器下实验室语料库的IG_SFLA和IG的分类准确率
特征选择
方法
维数
CHI
(%)
CHI_SFLA
(%)
IG
(%)
IG_SFLA
(%)
100 72.125 72.750 52.958 55.583
200 66.750 78.583 65.875 75.125
300 69.250 77.083 65.458 72.917
400 68.458 76.333 67.667 71.917
500 69.083 79.000 67.167 70.917
600 68.167 76.708 65.917 72.292
700 68.083 75.500 64.542 69.917
800 68.750 77.292 60.458 70.458
900 68.167 76.167 57.208 68.833
1 000 70.625 74.708 57.167 69.917
1 100 71.417 77.208 58.667 71.458
1 200 69.958 78.792 60.792 68.750
  KNN分类器下实验室语料库各个特征选择方法的分类准确率
  KNN分类器下实验室语料库的CHI_SFLA和CHI的分类准确率
  KNN分类器下实验室语料库的IG_SFLA和IG的分类准确率
成对差分 t df Sig. (双侧)
均值 标准差 均值的标准误 差分的95%置信区间
下限 上限
对1 P_old-P_new -5.39820 3.29716 .33651 -6.06626 -4.73013 -16.042 95 .000
  配对样本T检验结果表格
[1] 庞观松, 蒋盛益. 文本自动分类技术研究综述[J]. 情报理论与实践, 2012, 35(2): 123-128.
[1] (Pang Guansong, Jiang Shengyi.Text Automatic Classification Technology Research[J]. Information Studies: Theory & Application, 2012, 35(2): 123-128.)
[2] 吴科. 基于机器学习的文本分类研究[D]. 上海:上海交通大学, 2008.
[2] (Wu Ke.A Study on Text Categorization Based on Machine Learning [D]. Shanghai: Shanghai Jiaotong University, 2008.)
[3] 伍建军, 康耀红. 文本分类中特征选择方法的比较和改进[J]. 郑州大学学报: 理学版, 2007,39(2): 110-113.
doi: 10.3969/j.issn.1671-6841.2007.02.026
[3] (Wu Jianjun, Kang Yaohong.Comparison and Improvement of Feature Selection for Text Categorization[J]. Journal of Zhengzhou University: Natural Science Edition, 2007,39(2): 110-113.)
doi: 10.3969/j.issn.1671-6841.2007.02.026
[4] Yang Y, Pedersen J O.A Comparative Study on Feature Selection in Text Categorization[C]//Proceedings of the 14th International Conference on Machine Learning.San Francisco: Morgan Kaufmann Publishers Inc., 1997: 412-420.
[5] 符发. 中文文本分类中特征选择方法的比较[J]. 现代计算机: 专业版, 2008(6): 43-45.
[5] (Fu Fa.Comparison of Feature Selection in Chinese Text Categorization[J]. Modern Computer, 2008(6): 43-45.)
[6] Tabakhi S, Moradi P, Akhlaghian F.An Unsupervised Feature Selection Algorithm Based on Ant Colony Optimization[J]. Engineering Applications of Artificial Intelligence, 2014, 32: 112-123.
doi: 10.1016/j.engappai.2014.03.007
[7] 刘亚南. KNN文本分类中基于遗传算法的特征提取技术研究[D]. 北京: 中国石油大学, 2011.
[7] (Liu Ya’nan.Research of Feature Extraction Technology in KNN Text Classification Based on the Genetic Algorithm [D]. Beijing: China University of Petroleum, 2011.)
[8] 刘逵. 基于野草算法的文本特征选择研究[D]. 重庆: 西南大学, 2013.
[8] (Liu Kui.An Invasive Weed Optimization Algorithm for Text Feature Selection [D]. Chongqing: Southwest University, 2013.)
[9] Uguz H.A Two-stage Feature Selection Method for Text Categorization by Using Information Gain, Principal Component Analysis and Genetic Algorithm[J]. Knowledge-Based Systems, 2011, 24(7): 1024-1032.
doi: 10.1016/j.knosys.2011.04.014
[10] Javed K, Maruf S, Babri H A.A Two-stage Markov Blanket Based Feature Selection Algorithm for Text Classification[J]. Neurocomputing, 2015, 157: 91-104.
doi: 10.1016/j.neucom.2015.01.031
[11] Lu Y, Liang M, Ye Z, et al.Improved Particle Swarm Optimization Algorithm and Its Application in Text Feature Selection[J]. Applied Soft Computing, 2015, 35(C): 629-636.
doi: 10.1016/j.asoc.2015.07.005
[12] Eusuff M M, Lansey K E.Optimization of Water Distribution Network Design Using the Shuffled Frog Leaping Algorithm[J]. Journal of Water Resources Planning and Management, 2003, 129(3): 210-225.
[13] 崔文华, 刘晓冰, 王伟, 等. 混合蛙跳算法研究综述[J]. 控制与决策, 2012, 27(4): 481-486, 493.
[13] (Cui Wenhua, Liu Xiaobing, Wang Wei, et al.Survey on Shuffled Frog Leaping Algorithm[J]. Control and Decision, 2012, 27(4): 481-486, 493.)
[14] Elbehairy H, Elbeltagi E, Hegazy T, et al.Comparison of Two Evolutionary Algorithms for Optimization of Bridge Deck Repairs[J]. Computer-Aided Civil and Infrastructure Engineering, 2006, 21(8): 561-572.
doi: 10.1111/j.1467-8667.2006.00458.x
[15] 陈功贵, 李智欢, 陈金富, 等. 含风电场电力系统动态优化潮流的混合蛙跳算法[J]. 电力系统自动化, 2009, 33(4): 25-30.
[15] (Chen Gonggui, Li Zhihuan, Chen Jinfu, et al.SFL Algorithm Based Dynamic Optimal Power Flow in Wind Power Integrated System[J]. Automation of Electric Power Systems, 2009, 33(4): 25-30.)
[16] 张沈习, 陈楷, 龙禹, 等. 基于混合蛙跳算法的分布式风电源规划[J]. 电力系统自动化, 2013,37(13): 76-82.
doi: 10.7500/AEPS201207219
[16] (Zhang Shenxi, Chen Kai, Long Yu, et al.Distributed Wind Generator Planning Based Shuffled Frog Leaping Algorithm[J]. Automation of Electric Power Systems, 2013, 37(13): 76-82.)
doi: 10.7500/AEPS201207219
[17] 余华, 黄程韦, 金赟, 等. 基于改进的蛙跳算法的神经网络在语音情感识别中的研究[J]. 信号处理, 2010, 26(9): 1294-1299.
doi: 10.3969/j.issn.1003-0530.2010.09.003
[17] (Yu Hua, Huang Chengwei, Jin Yun, et al.Speech Emotion Recognition Based on Modified Shuffled Frog Leaping Algorithm Neural Network[J]. Signal Processing, 2010, 26(9): 1294-1299.)
doi: 10.3969/j.issn.1003-0530.2010.09.003
[18] 许方. 基于混合蛙跳算法的Web文本聚类研究[D]. 无锡:江南大学, 2013.
[18] (Xu Fang.Research on Web Text Cluster Algorithm Based on Shuffled Frog-leaping Algorithm [D]. Wuxi: Jiangnan University, 2013.)
[19] 尉建兴, 崔冬华, 宁晓青. 蛙跳算法在Web文本聚类技术中的应用[J]. 电脑开发与应用, 2011, 24(5): 35-37.
doi: 10.3969/j.issn.1003-5850.2011.05.013
[19] (Yu Jianxing, Cui Donghua, Ning Xiaoqing.Applicatin of Shuffled Frog-leaping Algorithm to Web’s Text Cluster Technology[J]. Computer Development & Applications, 2011, 24(5): 35-37.)
doi: 10.3969/j.issn.1003-5850.2011.05.013
[20] Sun X, Wang Z.An Efficient Document Categorization Algorithm Based on LDA and SFL[C]//Proceedings of the 2008 International Seminar on Business and Information Management. IEEE, 2008: 113-115.
[21] NLPIR 汉语分词系统 [EB/OL]. [2016-03-17]. .
[21] (NLPIR Chinese Word Segmentation System [EB/OL]. [2016-03-17].
[22] 路永和, 彭燕虹. 融合实用性与科学性的互联网信息分类体系构建[J]. 图书与情报, 2015(3): 118-124.
doi: 10.11968/tsygb.1003-6938.2015072
[22] (Lu Yonghe, Peng Yanhong.The Classification System Construction for Internet Information both Practical and Scientific[J]. Library and Information, 2015(3): 118-124.)
doi: 10.11968/tsygb.1003-6938.2015072
[1] 唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[2] 王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[3] 徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[4] 余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[5] 聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 *[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[6] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[7] 秦贺然,刘浏,李斌,王东波. 融入实体特征的典籍自动分类研究 *[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[8] 陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 *[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[9] 周成,魏红芹. 专利价值评估与分类研究*——基于自组织映射支持向量机[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[10] 梁家铭,赵洁,Jianlong Zhou,董振宁. 用户隐式行为挖掘在抗信誉共谋中的应用研究*[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
[11] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[12] 温廷新,李洋子,孙静霜. 基于多因素特征选择与AFOA/K-means的新闻热点发现方法*[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[13] 谭章禄,王兆刚,胡翰. 一种基于χ2统计的特征分类选择方法研究*[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[14] 陶志勇,李小兵,刘影,刘晓芳. 基于双向长短时记忆网络的改进注意力短文本分类方法 *[J]. 数据分析与知识发现, 2019, 3(12): 21-29.
[15] 李钰曼,陈志泊,许福. 基于KACC模型的文本分类研究 *[J]. 数据分析与知识发现, 2019, 3(10): 89-97.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn