Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (1): 91-101    DOI: 10.11925/infotech.2096-3467.2017.01.11
Orginal Article Current Issue | Archive | Adv Search |
Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm
Yonghe Lu(),Jinghuang Chen
School of Information Management, Sun Yat-Sen University, Guangzhou 510006, China
Download: PDF(1084 KB)   HTML ( 44
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective]This paper introduces the shuffled frog leaping algorithm (SFLA) to remove the irrelevant terms from the texts, and optimizes the feature selection method to improve the accuracy of text classification. [Methods] First, we used CHI and IG techniques to pre-select different dimensions of feature terms, and then adopted the modified SFLA to refine the text features’ list. Second, we used a frog to represent a feature selection rule, and applied the classification precision as the fitness function. Finally, the SVM and KNN classifier were adopted to calculate the classification precision. [Results] The modified SFLA had better performance in classification precision than CHI and IG, and the highest increasing rate was 12%. [Limitations] The feature over fitting occured in small portion of space dimensions. [Conclusions] Using feature preselection and the modified SFLA could effectively exclude irrelevant or invalid terms, and then improve the precision of feature selection.

Key wordsFeature Selection      Text Classification      Shuffled Frog Leaping Algorithm     
Received: 30 September 2016      Published: 22 February 2017

Cite this article:

Yonghe Lu,Jinghuang Chen. Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm. Data Analysis and Knowledge Discovery, 2017, 1(1): 91-101.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.01.11     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I1/91

[1] 庞观松, 蒋盛益. 文本自动分类技术研究综述[J]. 情报理论与实践, 2012, 35(2): 123-128.
[1] (Pang Guansong, Jiang Shengyi.Text Automatic Classification Technology Research[J]. Information Studies: Theory & Application, 2012, 35(2): 123-128.)
[2] 吴科. 基于机器学习的文本分类研究[D]. 上海:上海交通大学, 2008.
[2] (Wu Ke.A Study on Text Categorization Based on Machine Learning [D]. Shanghai: Shanghai Jiaotong University, 2008.)
[3] 伍建军, 康耀红. 文本分类中特征选择方法的比较和改进[J]. 郑州大学学报: 理学版, 2007,39(2): 110-113.
[3] (Wu Jianjun, Kang Yaohong.Comparison and Improvement of Feature Selection for Text Categorization[J]. Journal of Zhengzhou University: Natural Science Edition, 2007,39(2): 110-113.)
[4] Yang Y, Pedersen J O.A Comparative Study on Feature Selection in Text Categorization[C]//Proceedings of the 14th International Conference on Machine Learning.San Francisco: Morgan Kaufmann Publishers Inc., 1997: 412-420.
[5] 符发. 中文文本分类中特征选择方法的比较[J]. 现代计算机: 专业版, 2008(6): 43-45.
[5] (Fu Fa.Comparison of Feature Selection in Chinese Text Categorization[J]. Modern Computer, 2008(6): 43-45.)
[6] Tabakhi S, Moradi P, Akhlaghian F.An Unsupervised Feature Selection Algorithm Based on Ant Colony Optimization[J]. Engineering Applications of Artificial Intelligence, 2014, 32: 112-123.
[7] 刘亚南. KNN文本分类中基于遗传算法的特征提取技术研究[D]. 北京: 中国石油大学, 2011.
[7] (Liu Ya’nan.Research of Feature Extraction Technology in KNN Text Classification Based on the Genetic Algorithm [D]. Beijing: China University of Petroleum, 2011.)
[8] 刘逵. 基于野草算法的文本特征选择研究[D]. 重庆: 西南大学, 2013.
[8] (Liu Kui.An Invasive Weed Optimization Algorithm for Text Feature Selection [D]. Chongqing: Southwest University, 2013.)
[9] Uguz H.A Two-stage Feature Selection Method for Text Categorization by Using Information Gain, Principal Component Analysis and Genetic Algorithm[J]. Knowledge-Based Systems, 2011, 24(7): 1024-1032.
[10] Javed K, Maruf S, Babri H A.A Two-stage Markov Blanket Based Feature Selection Algorithm for Text Classification[J]. Neurocomputing, 2015, 157: 91-104.
[11] Lu Y, Liang M, Ye Z, et al.Improved Particle Swarm Optimization Algorithm and Its Application in Text Feature Selection[J]. Applied Soft Computing, 2015, 35(C): 629-636.
[12] Eusuff M M, Lansey K E.Optimization of Water Distribution Network Design Using the Shuffled Frog Leaping Algorithm[J]. Journal of Water Resources Planning and Management, 2003, 129(3): 210-225.
[13] 崔文华, 刘晓冰, 王伟, 等. 混合蛙跳算法研究综述[J]. 控制与决策, 2012, 27(4): 481-486, 493.
[13] (Cui Wenhua, Liu Xiaobing, Wang Wei, et al.Survey on Shuf?ed Frog Leaping Algorithm[J]. Control and Decision, 2012, 27(4): 481-486, 493.)
[14] Elbehairy H, Elbeltagi E, Hegazy T, et al.Comparison of Two Evolutionary Algorithms for Optimization of Bridge Deck Repairs[J]. Computer-Aided Civil and Infrastructure Engineering, 2006, 21(8): 561-572.
[15] 陈功贵, 李智欢, 陈金富, 等. 含风电场电力系统动态优化潮流的混合蛙跳算法[J]. 电力系统自动化, 2009, 33(4): 25-30.
[15] (Chen Gonggui, Li Zhihuan, Chen Jinfu, et al.SFL Algorithm Based Dynamic Optimal Power Flow in Wind Power Integrated System[J]. Automation of Electric Power Systems, 2009, 33(4): 25-30.)
[16] 张沈习, 陈楷, 龙禹, 等. 基于混合蛙跳算法的分布式风电源规划[J]. 电力系统自动化, 2013,37(13): 76-82.
[16] (Zhang Shenxi, Chen Kai, Long Yu, et al.Distributed Wind Generator Planning Based Shuffled Frog Leaping Algorithm[J]. Automation of Electric Power Systems, 2013, 37(13): 76-82.)
[17] 余华, 黄程韦, 金赟, 等. 基于改进的蛙跳算法的神经网络在语音情感识别中的研究[J]. 信号处理, 2010, 26(9): 1294-1299.
[17] (Yu Hua, Huang Chengwei, Jin Yun, et al.Speech Emotion Recognition Based on Modified Shuffled Frog Leaping Algorithm Neural Network[J]. Signal Processing, 2010, 26(9): 1294-1299.)
[18] 许方. 基于混合蛙跳算法的Web文本聚类研究[D]. 无锡:江南大学, 2013.
[18] (Xu Fang.Research on Web Text Cluster Algorithm Based on Shuffled Frog-leaping Algorithm [D]. Wuxi: Jiangnan University, 2013.)
[19] 尉建兴, 崔冬华, 宁晓青. 蛙跳算法在Web文本聚类技术中的应用[J]. 电脑开发与应用, 2011, 24(5): 35-37.
[19] (Yu Jianxing, Cui Donghua, Ning Xiaoqing.Applicatin of Shuffled Frog-leaping Algorithm to Web’s Text Cluster Technology[J]. Computer Development & Applications, 2011, 24(5): 35-37.)
[20] Sun X, Wang Z.An Efficient Document Categorization Algorithm Based on LDA and SFL[C]//Proceedings of the 2008 International Seminar on Business and Information Management. IEEE, 2008: 113-115.
[21] NLPIR 汉语分词系统 [EB/OL]. [2016-03-17]. .
[21] (NLPIR Chinese Word Segmentation System [EB/OL]. [2016-03-17].
[22] 路永和, 彭燕虹. 融合实用性与科学性的互联网信息分类体系构建[J]. 图书与情报, 2015(3): 118-124.
[22] (Lu Yonghe, Peng Yanhong.The Classification System Construction for Internet Information both Practical and Scientific[J]. Library and Information, 2015(3): 118-124.)
[1] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[2] Jiaming Liang,Jie Zhao,Zhou Jianlong,Zhenning Dong. Detecting Collusive Fraudulent Online Transaction with Implicit User Behaviors[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
[3] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[4] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[5] Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[6] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[7] Xinlei Li,Hao Wang,Xiaomin Liu,Sanhong Deng. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[8] Tingxin Wen,Yangzi Li,Jingshuang Sun. Extracting Text Features with Improved Fruit Fly Optimization Algorithm[J]. 数据分析与知识发现, 2018, 2(5): 59-69.
[9] Liu Liu,Dongbo Wang. Identifying Interdisciplinary Social Science Research Based on Article Classification[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[10] Zhipeng Li,Weizhong Li. Feature Selection Based on Modified QPSO Algorithm[J]. 数据分析与知识发现, 2017, 1(7): 82-89.
[11] Yue Zhang,Dongbo Wang,Danhao Zhu. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[12] Xiangdong Li,Tao Ruan,Kang Liu. Automatic Classification of Documents from Wikipedia[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[13] Liu Hongguang,Ma Shuanggang,Liu Guifeng. Classifying Chinese News Texts with Denoising Auto Encoder[J]. 现代图书情报技术, 2016, 32(6): 12-19.
[14] Meng Yuan,Wang Hongwei. Evaluating Online Reviews Based on Text Content Features[J]. 现代图书情报技术, 2016, 32(4): 40-47.
[15] Qun Zhang, Hongjun Wang, Lunwen Wang. Classifying Short Texts with Word Embedding and LDA Model[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn