Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (1): 91-101    DOI: 10.11925/infotech.2096-3467.2017.01.11
Orginal Article Current Issue | Archive | Adv Search |
Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm
Lu Yonghe(), Chen Jinghuang
School of Information Management, Sun Yat-Sen University, Guangzhou 510006, China
Download: PDF (1084 KB)   HTML ( 45
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective]This paper introduces the shuffled frog leaping algorithm (SFLA) to remove the irrelevant terms from the texts, and optimizes the feature selection method to improve the accuracy of text classification. [Methods] First, we used CHI and IG techniques to pre-select different dimensions of feature terms, and then adopted the modified SFLA to refine the text features’ list. Second, we used a frog to represent a feature selection rule, and applied the classification precision as the fitness function. Finally, the SVM and KNN classifier were adopted to calculate the classification precision. [Results] The modified SFLA had better performance in classification precision than CHI and IG, and the highest increasing rate was 12%. [Limitations] The feature over fitting occured in small portion of space dimensions. [Conclusions] Using feature preselection and the modified SFLA could effectively exclude irrelevant or invalid terms, and then improve the precision of feature selection.

Key wordsFeature Selection      Text Classification      Shuffled Frog Leaping Algorithm     
Received: 30 September 2016      Published: 22 February 2017
ZTFLH:  TP391  

Cite this article:

Lu Yonghe,Chen Jinghuang. Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm. Data Analysis and Knowledge Discovery, 2017, 1(1): 91-101.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.01.11     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I1/91

类别 acq crude earn grain interest money-fx ship trade 总数
训练集 1 596 253 2 840 41 190 206 108 251 5 485
大测
试集
696 121 1 083 10 81 87 36 75 2 189
特征选择
方法
维数
CHI
(%)
CHI_SFLA(%) IG
(%)
IG_SFLA
(%)
100 93.102 92.143 90.132 90.772
200 93.878 92.965 91.366 92.873
300 92.554 92.005 89.082 92.736
400 91.000 94.381 86.249 92.873
500 90.726 94.153 85.381 92.325
600 87.848 92.599 84.651 92.645
700 85.975 93.878 83.919 92.462
800 85.244 93.970 83.645 92.234
900 84.513 93.878 83.326 91.594
1 000 84.011 93.559 82.914 91.640
1 100 83.646 94.107 82.686 93.376
1 200 83.189 94.290 82.412 92.828
特征选择
方法
维数
CHI
(%)
CHI_SFLA
(%)
IG
(%)
IG_SFLA
(%)
100 90.361 91.914 87.391 90.955
200 88.305 90.909 89.356 90.452
300 87.483 91.275 89.082 90.361
400 86.752 89.630 89.676 89.676
500 87.300 91.366 88.305 88.716
600 87.163 91.594 87.483 89.402
700 86.661 91.138 87.117 89.630
800 85.564 88.671 86.341 89.950
900 84.742 88.031 86.067 89.676
1 000 83.920 88.077 85.062 89.127
1 100 81.361 87.803 84.376 89.493
1 200 81.635 87.163 83.919 89.721
特征选择
方法
维数
CHI
(%)
CHI_SFLA
(%)
IG
(%)
IG_SFLA
(%)
100 77.042 77.417 55.667 56.958
200 83.292 85.792 68.667 76.333
300 80.833 86.083 73.833 83.083
400 77.458 84.625 77.083 79.000
500 78.875 85.708 78.708 80.292
600 80.583 86.167 80.083 83.458
700 80.417 86.208 81.167 84.625
800 80.375 85.333 81.833 86.250
900 80.667 85.958 81.417 84.708
1 000 80.750 87.292 81.167 86.667
1 100 80.583 84.667 80.500 82.125
1 200 80.208 86.042 80.250 83.250
特征选择
方法
维数
CHI
(%)
CHI_SFLA
(%)
IG
(%)
IG_SFLA
(%)
100 72.125 72.750 52.958 55.583
200 66.750 78.583 65.875 75.125
300 69.250 77.083 65.458 72.917
400 68.458 76.333 67.667 71.917
500 69.083 79.000 67.167 70.917
600 68.167 76.708 65.917 72.292
700 68.083 75.500 64.542 69.917
800 68.750 77.292 60.458 70.458
900 68.167 76.167 57.208 68.833
1 000 70.625 74.708 57.167 69.917
1 100 71.417 77.208 58.667 71.458
1 200 69.958 78.792 60.792 68.750
成对差分 t df Sig. (双侧)
均值 标准差 均值的标准误 差分的95%置信区间
下限 上限
对1 P_old-P_new -5.39820 3.29716 .33651 -6.06626 -4.73013 -16.042 95 .000
[1] 庞观松, 蒋盛益. 文本自动分类技术研究综述[J]. 情报理论与实践, 2012, 35(2): 123-128.
[1] (Pang Guansong, Jiang Shengyi.Text Automatic Classification Technology Research[J]. Information Studies: Theory & Application, 2012, 35(2): 123-128.)
[2] 吴科. 基于机器学习的文本分类研究[D]. 上海:上海交通大学, 2008.
[2] (Wu Ke.A Study on Text Categorization Based on Machine Learning [D]. Shanghai: Shanghai Jiaotong University, 2008.)
[3] 伍建军, 康耀红. 文本分类中特征选择方法的比较和改进[J]. 郑州大学学报: 理学版, 2007,39(2): 110-113.
doi: 10.3969/j.issn.1671-6841.2007.02.026
[3] (Wu Jianjun, Kang Yaohong.Comparison and Improvement of Feature Selection for Text Categorization[J]. Journal of Zhengzhou University: Natural Science Edition, 2007,39(2): 110-113.)
doi: 10.3969/j.issn.1671-6841.2007.02.026
[4] Yang Y, Pedersen J O.A Comparative Study on Feature Selection in Text Categorization[C]//Proceedings of the 14th International Conference on Machine Learning.San Francisco: Morgan Kaufmann Publishers Inc., 1997: 412-420.
[5] 符发. 中文文本分类中特征选择方法的比较[J]. 现代计算机: 专业版, 2008(6): 43-45.
[5] (Fu Fa.Comparison of Feature Selection in Chinese Text Categorization[J]. Modern Computer, 2008(6): 43-45.)
[6] Tabakhi S, Moradi P, Akhlaghian F.An Unsupervised Feature Selection Algorithm Based on Ant Colony Optimization[J]. Engineering Applications of Artificial Intelligence, 2014, 32: 112-123.
doi: 10.1016/j.engappai.2014.03.007
[7] 刘亚南. KNN文本分类中基于遗传算法的特征提取技术研究[D]. 北京: 中国石油大学, 2011.
[7] (Liu Ya’nan.Research of Feature Extraction Technology in KNN Text Classification Based on the Genetic Algorithm [D]. Beijing: China University of Petroleum, 2011.)
[8] 刘逵. 基于野草算法的文本特征选择研究[D]. 重庆: 西南大学, 2013.
[8] (Liu Kui.An Invasive Weed Optimization Algorithm for Text Feature Selection [D]. Chongqing: Southwest University, 2013.)
[9] Uguz H.A Two-stage Feature Selection Method for Text Categorization by Using Information Gain, Principal Component Analysis and Genetic Algorithm[J]. Knowledge-Based Systems, 2011, 24(7): 1024-1032.
doi: 10.1016/j.knosys.2011.04.014
[10] Javed K, Maruf S, Babri H A.A Two-stage Markov Blanket Based Feature Selection Algorithm for Text Classification[J]. Neurocomputing, 2015, 157: 91-104.
doi: 10.1016/j.neucom.2015.01.031
[11] Lu Y, Liang M, Ye Z, et al.Improved Particle Swarm Optimization Algorithm and Its Application in Text Feature Selection[J]. Applied Soft Computing, 2015, 35(C): 629-636.
doi: 10.1016/j.asoc.2015.07.005
[12] Eusuff M M, Lansey K E.Optimization of Water Distribution Network Design Using the Shuffled Frog Leaping Algorithm[J]. Journal of Water Resources Planning and Management, 2003, 129(3): 210-225.
[13] 崔文华, 刘晓冰, 王伟, 等. 混合蛙跳算法研究综述[J]. 控制与决策, 2012, 27(4): 481-486, 493.
[13] (Cui Wenhua, Liu Xiaobing, Wang Wei, et al.Survey on Shuffled Frog Leaping Algorithm[J]. Control and Decision, 2012, 27(4): 481-486, 493.)
[14] Elbehairy H, Elbeltagi E, Hegazy T, et al.Comparison of Two Evolutionary Algorithms for Optimization of Bridge Deck Repairs[J]. Computer-Aided Civil and Infrastructure Engineering, 2006, 21(8): 561-572.
doi: 10.1111/j.1467-8667.2006.00458.x
[15] 陈功贵, 李智欢, 陈金富, 等. 含风电场电力系统动态优化潮流的混合蛙跳算法[J]. 电力系统自动化, 2009, 33(4): 25-30.
[15] (Chen Gonggui, Li Zhihuan, Chen Jinfu, et al.SFL Algorithm Based Dynamic Optimal Power Flow in Wind Power Integrated System[J]. Automation of Electric Power Systems, 2009, 33(4): 25-30.)
[16] 张沈习, 陈楷, 龙禹, 等. 基于混合蛙跳算法的分布式风电源规划[J]. 电力系统自动化, 2013,37(13): 76-82.
doi: 10.7500/AEPS201207219
[16] (Zhang Shenxi, Chen Kai, Long Yu, et al.Distributed Wind Generator Planning Based Shuffled Frog Leaping Algorithm[J]. Automation of Electric Power Systems, 2013, 37(13): 76-82.)
doi: 10.7500/AEPS201207219
[17] 余华, 黄程韦, 金赟, 等. 基于改进的蛙跳算法的神经网络在语音情感识别中的研究[J]. 信号处理, 2010, 26(9): 1294-1299.
doi: 10.3969/j.issn.1003-0530.2010.09.003
[17] (Yu Hua, Huang Chengwei, Jin Yun, et al.Speech Emotion Recognition Based on Modified Shuffled Frog Leaping Algorithm Neural Network[J]. Signal Processing, 2010, 26(9): 1294-1299.)
doi: 10.3969/j.issn.1003-0530.2010.09.003
[18] 许方. 基于混合蛙跳算法的Web文本聚类研究[D]. 无锡:江南大学, 2013.
[18] (Xu Fang.Research on Web Text Cluster Algorithm Based on Shuffled Frog-leaping Algorithm [D]. Wuxi: Jiangnan University, 2013.)
[19] 尉建兴, 崔冬华, 宁晓青. 蛙跳算法在Web文本聚类技术中的应用[J]. 电脑开发与应用, 2011, 24(5): 35-37.
doi: 10.3969/j.issn.1003-5850.2011.05.013
[19] (Yu Jianxing, Cui Donghua, Ning Xiaoqing.Applicatin of Shuffled Frog-leaping Algorithm to Web’s Text Cluster Technology[J]. Computer Development & Applications, 2011, 24(5): 35-37.)
doi: 10.3969/j.issn.1003-5850.2011.05.013
[20] Sun X, Wang Z.An Efficient Document Categorization Algorithm Based on LDA and SFL[C]//Proceedings of the 2008 International Seminar on Business and Information Management. IEEE, 2008: 113-115.
[21] NLPIR 汉语分词系统 [EB/OL]. [2016-03-17]. .
[21] (NLPIR Chinese Word Segmentation System [EB/OL]. [2016-03-17].
[22] 路永和, 彭燕虹. 融合实用性与科学性的互联网信息分类体系构建[J]. 图书与情报, 2015(3): 118-124.
doi: 10.11968/tsygb.1003-6938.2015072
[22] (Lu Yonghe, Peng Yanhong.The Classification System Construction for Internet Information both Practical and Scientific[J]. Library and Information, 2015(3): 118-124.)
doi: 10.11968/tsygb.1003-6938.2015072
[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] Yu Bengong,Zhu Xiaojie,Zhang Ziwei. A Capsule Network Model for Text Classification with Multi-level Feature Extraction[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[4] Liang Jiaming, Zhao Jie, Zheng Peng, Huang Liushen, Ye Minqi, Dong Zhenning. Framework for Computing Trust in Online Short-Rent Platform Using Feature Selection of Images and Texts[J]. 数据分析与知识发现, 2021, 5(2): 129-140.
[5] Wang Yan, Wang Huyan, Yu Bengong. Chinese Text Classification with Feature Fusion[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[6] Wang Sidi,Hu Guangwei,Yang Siyu,Shi Yun. Automatic Transferring Government Website E-Mails Based on Text Classification[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[7] Xu Yuemei,Liu Yunwen,Cai Lianqiao. Predicitng Retweets of Government Microblogs with Deep-combined Features[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[8] Xu Tongtong,Sun Huazhi,Ma Chunmei,Jiang Lifen,Liu Yichen. Classification Model for Few-shot Texts Based on Bi-directional Long-term Attention Features[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[9] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[10] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[11] Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[12] Heran Qin,Liu Liu,Bin Li,Dongbo Wang. Automatic Classification of Ancient Classics with Entity Features[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[13] Guo Chen,Tianxiang Xu. Sentence Function Recognition Based on Active Learning[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
[14] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[15] Jiaming Liang,Jie Zhao,Zhou Jianlong,Zhenning Dong. Detecting Collusive Fraudulent Online Transaction with Implicit User Behaviors[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn