Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (4): 48-57    DOI: 10.11925/infotech.1003-3513.2014.04.08
Current Issue | Archive | Adv Search |
Improvement of Text Feature Extraction with Genetic Algorithm
Lu Yonghe, Liang Minghui
School of Information Management, Sun Yat-Sen University, Guangzhou 510006, China
Export: BibTeX | EndNote (RIS)      

[Objective] To comprehensively analyze many feature extraction methods and improve traditional feature extraction process. [Methods] Firstly, the paper uses feature pool to pre-extract features, then extract best feature set by genetic algorithm and group coding. [Results] When the fitness function uses KNN classification algorithm, the method using in this paper shows the best performance. Besides, the effect is more obvious with less feature dimensions. Simultaneously, the proposed method has better stability in text classification for different feature dimensions and corpuses. [Limitations] The corpus is not abundant enough. Only IG and CHI are used to extract features for feature pool construction. It ignores semantic relationships among words for group coding. The population size and the number of iteration in genetic algorithm are restricted by experimental conditions. [Conclusions] The stability of text classification is improved by adding a feature pool to pre-extract features. The result of text classification is more accurate by adding genetic algorithm in the text feature extraction. To use proposed method reduces overfitting of features and improves efficiency by utilizing group coding in the genetic algorithm.

Key wordsText categorization      Feature extraction      Genetic algorithms      Feature pool     
Received: 25 December 2013      Published: 19 May 2014
:  G254  

Cite this article:

Lu Yonghe, Liang Minghui. Improvement of Text Feature Extraction with Genetic Algorithm. New Technology of Library and Information Service, 2014, 30(4): 48-57.

URL:     OR

[1] 肖可,奉国和.1999-2008年国内文本分类研究文献计量分析[J].情报学报,2010,29(4):679-687.(Xiao Ke,Feng Guohe.A Statistical Analysis of Papers on Text Categorization from 1999 to 2008 in China[J].Journal of the China Society for Scientific and Technical Information,2010,29(4):679-687.)
[2] Yang Y,Pedersen J O.A Comparative Study on Feature Selection in Text Categorization[C].In:Proceedings of the 14th International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc.,1997:412-420.
[3] 苏新宁.信息检索理论与技术[M].北京:科学技术文献出版社,2004:273-307.(Su Xinning.Information Retrieval Theory and Technology[M].Beijing:Science and Technology Documentation Press,2004:273-307.)
[4] 伍建军,康耀红.文本分类中特征选择方法的比较和改进[J].郑州大学学报:理学版,2007,39(2):110-113.(Wu Jianjun,Kang Yaohong.Comparison and Improvement of Feature Selection for Text Categorization[J].Journal of Zhengzhou University:Natural Science Edition,2007,39(2):110-113.)
[5] 符发.中文文本分类中特征选择方法的比较[J].现代计算机(专业版),2008(6):43-45.(Fu Fa.Comparison of Feature Selection in Chinese Text Categorization[J].Modern Computer,2008(6):43-45.)
[6] Raymer M L,Punch W F,Goodman E D,et al.Dimensiona­lity Reduction Using Genetic Algorithms[J].IEEE Transactions on Evolutionary Computation,2000,4(2):164-171.
[7] Cantú-Paz E.Feature Subset Selection,Class Separability,and Genetic Algorithms[C].In:Proceedings of Genetic and Evolutionary Computation-GECCO 2004.Berlin,Heidelberg:Springer,2004:959-970.
[8] Rajavarman V N,Rajagopalan S P.Feature Selection in Data-Mining for Genetics Using Genetic Algorithm[J].Journal of Computer Science,2007,3(9):723-725.
[9] Tan F,Fu X,Zhang Y,et al.A Genetic Algorithm-Based Method for Feature Subset Selection[J].Soft Computing,2008,12(2):111-120.
[10] 郝占刚,王正欧.基于潜在语义索引和遗传算法的文本特征提取方法[J].情报科学,2006,24(1):104-107.(Hao Zhan'gang,Wang Zheng'ou.The Method of Text Feature Selection Based on LSI and GA[J].Information Science,2006,24(1):104-107.)
[11] 刘亚南.KNN 文本分类中基于遗传算法的特征提取技术研究[D].青岛:中国石油大学,2011.(Liu Ya'nan.Research of Feature Extraction Technology in KNN Text Classification Based on the Genetic Algorithm[D].Qingdao:China University of Petroleum,2011.)
[12] 张志宏,寇纪淞,陈富赞,等.基于遗传算法的顾客购买行为特征提取[J].模式识别与人工智能,2010,23(2):256-266.(Zhang Zhihong,Kou Jisong,Chen Fuzan,et al.Feature Extraction of Customer Purchase Behavior Based on Genetic Algorithm[J].Pattern Recognition and Artificial Intelligence,2010,23(2):256-266.)
[13] 龙鹏飞,王莹莹,段焰.基于蚁群遗传算法的中文文本分类中的特征提取[J].计算机应用与软件,2008,25(12):106-108.(Long Pengfei,Wang Yingying,Duan Yan.Feature Selection in Chinese Text Categorization Based on Ant Colony Algorithm and Genetic Algorithm[J].Computer Applications and Software,2008,25(12):106-108.)
[14] 高贤维,刘三民,王杰文.基于遗传算法和神经网络的特征提取[J].计算机与现代化,2008(4):23-26.(Gao Xianwei,Liu Sanmin,Wang Jiewen.Feature Extraction Based on Genetic Algorithm and Artificial Neural Network[J].Computer and Modernization,2008(4):23-26.)
[15] 路永和,曹利朝.基于粒子群优化的文本特征选择方法[J].现代图书情报技术,2011(1):76-81.(Lu Yonghe,Cao Lichao.Text Feature Selection Method Based on Particle Swarm Optimization[J].New Technology of Library and Information Service,2011(1):76-81.)
[16] 王小平,曹立明.遗传算法:理论,应用及软件实现[M].西安:西安交通大学出版社,2002:55-65.(Wang Xiaoping,Cao Liming.Genetic Algorithm:Theory,Application and Software Implementation[M].Xi'an:Xi'an Jiaotong University Press,2002:55-65.)

[1] Zheng Xinman, Dong Yu. Constructing Degree Lexicon for STI Policy Texts[J]. 数据分析与知识发现, 2021, 5(10): 81-93.
[2] Cai Jingxuan,Wu Jiang,Wang Chengkun. Predicting Usefulness of Crowd Testing Reports with Deep Learning[J]. 数据分析与知识发现, 2020, 4(11): 102-111.
[3] Hui Nie,Huan He. Identifying Implicit Features with Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
[4] Gang Li,Huayang Zhou,Jin Mao,Sijing Chen. Classifying Social Media Users with Machine Learning[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
[5] Xiaofeng Li,Jing Ma,Chi Li,Hengmin Zhu. Identifying Commodity Names Based on XGBoost Model[J]. 数据分析与知识发现, 2019, 3(7): 34-41.
[6] Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[7] Jiao Yan,Jing Ma,Kang Fang. Computing Text Semantic Similarity with Syntactic Network of Co-occurrence Distance[J]. 数据分析与知识发现, 2019, 3(12): 93-100.
[8] Qinghong Zhong,Xiaodong Qiao,Yunliang Zhang,Mengjuan Weng. Cross-media Fusion Method Based on LDA2Vec and Residual Network[J]. 数据分析与知识发现, 2019, 3(10): 78-88.
[9] Guijun Yang,Xue Xu,Fuqiang Zhao. Predicting User Ratings with XGBoost Algorithm[J]. 数据分析与知识发现, 2019, 3(1): 118-126.
[10] Li Xiangdong,Gao Fan,Li Youhai. Categorizing Documents Automatically within Common Semantic Space[J]. 数据分析与知识发现, 2018, 2(9): 66-73.
[11] Zhou Lixin,Lin Jie. Extracting Product Features with NodeRank Algorithm[J]. 数据分析与知识发现, 2018, 2(4): 90-98.
[12] Feng Guoming,Zhang Xiaodong,Liu Suhui. Classifying Chinese Texts with CapsNet[J]. 数据分析与知识发现, 2018, 2(12): 68-76.
[13] Huang Xiaoxi,Li Hanyu,Wang Rongbo,Wang Xiaohua,Chen Zhiqun. Recognizing Metaphor with Convolution Neural Network and SVM[J]. 数据分析与知识发现, 2018, 2(10): 77-83.
[14] Li Weiqing,Wang Weijun. Building Product Feature Dictionary with Large-scale Review Data[J]. 数据分析与知识发现, 2018, 2(1): 41-50.
[15] Li Changbing,Pang Chongpeng,Li Meiping. Extracting Product Features with Weight-based Apriori Algorithm[J]. 数据分析与知识发现, 2017, 1(9): 83-89.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938