Please wait a minute...
New Technology of Library and Information Service  2004, Vol. 20 Issue (12): 7-9    DOI: 10.11925/infotech.1003-3513.2004.12.02
article Current Issue | Archive | Adv Search |
Application of Improved Information Gain Feature Selection Methodto Text Clustering
Chen Tao1   Song Yan2   Xie Yangqun1
1(Department of Management Science and Engineering, Ningbo, Zhejiang 315211,China)
2(Department of Business Administration,Nanjing,Jiangsu 210093,China)
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

This paper applies the improved information gain method to the text clustering. Retrieving 250 from the corpus, according to Vector Space Model and the information gain feature selection method,construct the text feature vector;use C-means to automatic clustering, the precision、recall and F-measure are 0.82、0.88、0.83.

Key wordsInformation gain      Feature selection      Clustering     
Received: 07 July 2004      Published: 25 December 2004
ZTFLH: 

TP181 

 
     
  G352

 
Corresponding Authors: Xie Yangqun     E-mail: xieyangqun1980@yahoo.com.cn
About author:: Chen Tao,Song Yan,Xie Yangqun

Cite this article:

Chen Tao,Song Yan,Xie Yangqun. Application of Improved Information Gain Feature Selection Methodto Text Clustering. New Technology of Library and Information Service, 2004, 20(12): 7-9.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2004.12.02     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2004/V20/I12/7

1  Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys. 2002, 34(1):1-47
2  Tom Mitchell. Machine learning. McGraw Hill, New York. 1996
3  秦进,陈笑蓉等.文本分类中的特征抽取.计算机应用,2003,23(2):45-46
4  刁倩, 王永成, 张惠惠等. 文本自动分类中的词权重与分类算法. 中文信息学报,2000, 14(3):25-29
5  李雪青,张冬荣.一种基于向量空间模型的文本分类方法.计算机工程,2003,29(17):90-92
6  代六玲,黄河燕,陈肇雄.中文文本分类特征抽取方法的比较研究。中文信息学报,2004,18(1):26-32

[1] Wang Ruolin, Niu Zhendong, Lin Qika, Zhu Yifan, Qiu Ping, Lu Hao, Liu Donglei. Disambiguating Author Names with Embedding Heterogeneous Information and Attentive RNN Clustering Parameters[J]. 数据分析与知识发现, 2021, 5(8): 13-24.
[2] Wang Xiwei,Jia Ruonan,Wei Yanan,Zhang Liu. Clustering User Groups of Public Opinion Events from Multi-dimensional Social Network[J]. 数据分析与知识发现, 2021, 5(6): 25-35.
[3] Lu Linong,Zhu Zhongming,Zhang Wangqiang,Wang Xiaochun. Cross-database Knowledge Integration and Fingerprint of Institutional Repositories with Lingo3G Clustering Algorithm[J]. 数据分析与知识发现, 2021, 5(5): 127-132.
[4] Liang Jiaming, Zhao Jie, Zheng Peng, Huang Liushen, Ye Minqi, Dong Zhenning. Framework for Computing Trust in Online Short-Rent Platform Using Feature Selection of Images and Texts[J]. 数据分析与知识发现, 2021, 5(2): 129-140.
[5] Zhang Mengyao, Zhu Guangli, Zhang Shunxiang, Zhang Biao. Grouping Microblog Users of Trending Topics Based on Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(2): 43-49.
[6] Ding Hao, Ai Wenhua, Hu Guangwei, Li Shuqing, Suo Wei. A Personalized Recommendation Model with Time Series Fluctuation of User Interest[J]. 数据分析与知识发现, 2021, 5(11): 45-58.
[7] Yang Chen, Chen Xiaohong, Wang Chuhan, Liu Tingting. Recommendation Strategy Based on Users’ Preferences for Fine-Grained Attributes[J]. 数据分析与知识发现, 2021, 5(10): 94-102.
[8] Yu Fengchang,Cheng Qikai,Lu Wei. Locating Academic Literature Figures and Tables with Geometric Object Clustering[J]. 数据分析与知识发现, 2021, 5(1): 140-149.
[9] Wu Jinming,Hou Yuefang,Cui Lei. Automatic Expression of Co-occurrence Clustering Based on Indexing Rules of Medical Subject Headings[J]. 数据分析与知识发现, 2020, 4(9): 133-144.
[10] Wen Pingmei,Ye Zhiwei,Ding Wenjian,Liu Ying,Xu Jian. Developments of Named Entity Disambiguation[J]. 数据分析与知识发现, 2020, 4(9): 15-25.
[11] Xi Yunjiang, Du Diedie, Liao Xiao, Zhang Xuehong. Analyzing & Clustering Enterprise Microblog Users with Supernetwork[J]. 数据分析与知识发现, 2020, 4(8): 107-118.
[12] Yang Xu,Qian Xiaodong. Synchronous Clustering Algorithm for Social Networks Based on Improved Vicsek Model[J]. 数据分析与知识发现, 2020, 4(4): 119-128.
[13] Xiong Huixiang,Li Xiaomin,Li Yueyan. Group Recommendation Based on Attribute Mining of Book Reviews[J]. 数据分析与知识发现, 2020, 4(2/3): 214-222.
[14] Wang Gensheng,Pan Fangzheng. Matrix Factorization Algorithm with Weighted Heterogeneous Information Network[J]. 数据分析与知识发现, 2020, 4(12): 76-84.
[15] Wei Jiaze,Dong Cheng,He Yanqing,Liu Zhihui,Peng Keyun. Detecting News Topics Based on Equalized Paragraph and Sub-topic Vector[J]. 数据分析与知识发现, 2020, 4(10): 70-79.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn