Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (4): 94-99    DOI: 10.11925/infotech.2096-3467.2017.04.11
Orginal Article Current Issue | Archive | Adv Search |
Application of Text Clustering Method Based on Improved CFSFDP Algorithm
Zhan Chunxia(), Wang Rongbo, Huang Xiaoxi, Chen Zhiqun
School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China
Download: PDF (651 KB)   HTML ( 3
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to improve the un-satisfactory performance of CFSFDP (clustering by fast search and find of density peaks) algorithm with the help of based on particle swarm optimization. [Methods] First, we determined the cluster centers by searching optimal local density and distance thresholds to increase the accuracy of results. These clustering centers have relatively high local density and distance, which reduced the influence of discrete points. Then, we examined the proposed method on a randomly selected dataset from the question-answer database of a college entrance exam consulting platform. [Results] The modified CFSFDP algorithm had better performance than the original one. [Limitations] We did not include the semantic relations to process the texts. [Conclusions] The proposed algorithm could achieve good clustering results, and improve the efficiency of the consulting personnel .

Key wordsCFSDFP      Cluster Centers      Particle Swarm Optimization Algorithm     
Received: 30 December 2016      Published: 24 May 2017
ZTFLH:  TP391  

Cite this article:

Zhan Chunxia,Wang Rongbo,Huang Xiaoxi,Chen Zhiqun. Application of Text Clustering Method Based on Improved CFSFDP Algorithm. Data Analysis and Knowledge Discovery, 2017, 1(4): 94-99.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.04.11     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I4/94


数据集
代码 军训 加分 极差 电话 省控线 退档
data1050 200 100 200 100 150 200 100
data3100 600 300 600 300 500 500 300
data5000 1 000 400 1000 400 900 900 400
算法 数据集 Accuracy Precision Recall F-Measure
Agglomerative data1050 0.7305 0.7743 0.7969 0.7854
data3100 0.7077 0.6976 0.7811 0.7370
data5000 0.6808 0.6598 0.6627 0.6612
DBSCAN data1050 0.6486 0.6795 0.7332 0.7052
data3100 0.6797 0.6761 0.7880 0.7278
data5000 0.6006 0.6270 0.6500 0.6643
CFSFDP data1050 0.8171 0.8050 0.8090 0.8070
data3100 0.750 0.7375 0.6617 0.6975
data5000 0.7425 0.7438 0.6189 0.6756
本文算法 data1050 0.8333 0.7171 0.9098 0.8609
data3100 0.7574 0.7421 0.7676 0.7546
data5000 0.7712 0.7340 0.7450 0.7395
[1] Tan P N, Steinbach M, Kuma V.Introduce to Data Mining[M]. Addison-Wesley Professional, 1988.
[2] 孙吉贵, 刘杰, 赵连宇. 聚类算法研究[J]. 软件学报, 2008, 19(1): 48-61.
[2] (Sun Jigui, Liu Jie, Zhao Lianyu.Clustering Algorithms Research[J]. Journal of Software, 2008, 19(1): 48-61.)
[3] 史梦洁. 文本聚类算法综述[J]. 现代计算机, 2014(2): 3-6.
[3] (Shi Mengjie.Summary of Text Clustering Algorithms[J]. Modern Computer, 2014(2): 3-6.)
[4] Rodriguez A, Laio A.Clustering by Fast Search and Find of Density Peaks[J]. Science, 2014, 344(6191): 1492-1496.
doi: 10.1126/science.1242072
[5] 张文开. 基于密度的层次聚类算法研究[D]. 合肥: 中国科学技术大学, 2015.
[5] (Zhang Wenkai.Research on Density- based Hierarchical Clustering Algorithm[D]. Hefei: University of Science and Technology of China, 2015.)
[6] Mehmood R, Bie R, Dawood H, et al.Fuzzy Clustering by Fast Search and Find of Density Peaks[C]//Proceedings of the 2015 International Conference on Identification, Information, and Knowledge in the Internet of Things. 2015.
[7] 马春来, 单洪, 马涛. 一种基于簇中心点自动选择策略的密度峰值聚类算法[J]. 计算机科学, 2016, 43(7): 255-258.
doi: 10.11896/j.issn.1002-137X.2016.7.046
[7] (Ma Chunlai, Shan Hong, Ma Tao.Improved Density Peaks Based Clustering Algorithm with Strategy Choosing Cluster Center Automatically[J]. Computer Science, 2016, 43(7): 255-258.)
doi: 10.11896/j.issn.1002-137X.2016.7.046
[8] Kennedy J, Eberhart R.Partical Swarm Optimization[C]// Proceeding of the 1995 IEEE International Conference on Neural Networks. 1995.
[9] 刘建华. 粒子群算法的基本理论及其改进研究[D]. 长沙: 中南大学, 2009.
[9] (Liu Jianhua.The Basic Theory of Partical Swarm Optimization and Its Improvement[D]. Changsha: Central South University,2009.)
[10] 黄承慧, 印鉴, 侯昉. 一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J]. 计算机学报, 2011, 34(5): 856-864.
doi: 10.3724/SP.J.1016.2011.00856
[10] (Huang Chenghui, Yin Jian, Hou Fang.A Text Similarity Measurement Combining Word Semantic Information with TF-IDF Method[J]. Chinese Journal of Computer, 2011, 34(5): 856-864.)
doi: 10.3724/SP.J.1016.2011.00856
[11] Aizawa A.An Information-treoretic Perspective of TF-IDF Measures[J]. Information Processing and Management, 2003, 39(1): 45-65.
doi: 10.1016/S0306-4573(02)00021-3
[12] Salton G, Buckley C.Term Weight Approaches in Automatic Text Retrieval[J]. Information Processing and Management, 1988, 24(5): 513-523.
doi: 10.1016/0306-4573(88)90021-0
[13] 谭静. 基于向量空间模型的文本相似度算法研究[D]. 成都: 西南石油大学, 2015.
[13] (Tan Jing.Research on Text Similarity Algorithm Based on Vector Space Modal[D]. Chengdu: Southwest Petroleum University, 2015.)
[14] 赵俊杰, 胡学钢. 基于文本分类的文档相似度计算[J].微型电脑应用, 2008, 24(12): 46-47.
doi: 10.3969/j.issn.1007-757X.2008.12.016
[14] (Zhao Junjie, Hu Xuegang.Simility Calculation Based on Text Classification[J]. Microcomputer Application, 2008, 24(12): 46-47.)
doi: 10.3969/j.issn.1007-757X.2008.12.016
[15] Halkidi M, Batistakis Y, Vazirgiannis M.On Clustering Validation Techniques[J]. Journal of Intelligent Information Systems, 2015, 17(2-3): 107-145.
[16] Liang J, Bai L, Dang C, et al.The K-Means-Type Algorithms Versus Imbalanced Data Distributions[J]. IEEE Transactions on Fuzzy Systems, 2012, 20(4): 728-745.
doi: 10.1109/TFUZZ.2011.2182354
[17] 张鸣. 符号数据聚类评价指标研究[D]. 太原: 山西大学, 2013.
[17] (Zhang Ming.Study on the Evaluation Index Symbol of Data Clustering[D]. Taiyuan: University of Shanxi, 2013.)
[18] Franti P, Virmajoki O, Hautamaki V.Fast Agglomerative Clustering Using a K-nearest Neighbor Graph[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2006, 28(11): 1875-1881.
doi: 10.1109/TPAMI.2006.227 pmid: 17063692
[19] 段明秀. 层次聚类算法的研究及应用[D]. 长沙:中南大学, 2009.
[19] (Duan Mingxiu.Research and Application of Hierarchical Clustering Algorithm[J]. Changsha: Central South University, 2009.)
[20] 冯少荣, 肖文俊. DBSCAN聚类算法的研究与改进[J].中国矿业大学学报, 2008, 37(1): 106-111.
[20] (Feng Shaorong, Xiao Wenjun.An Improved DBSCAN Clustering Algorithm[J]. Journal of China University of Mining & Technology, 2008, 37(1): 106-111.)
[1] Gao Changyuan,Yu Jianping,He Xiaoyan. Knowledge Search for Cloud Computing Industry Alliance: An Algorithm Based on Improved Particle Swarm Optimization[J]. 数据分析与知识发现, 2017, 1(3): 81-89.
[2] Gong Kaile,Cheng Ying,Sun Jianjun. Clustering Blog Posts with Co-occurrence Analysis[J]. 现代图书情报技术, 2016, 32(10): 50-58.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn