Please wait a minute...
Advanced Search
数据分析与知识发现  2017, Vol. 1 Issue (4): 94-99     https://doi.org/10.11925/infotech.2096-3467.2017.04.11
  应用论文 本期目录 | 过刊浏览 | 高级检索 |
基于改进CFSFDP算法的文本聚类方法及其应用*
詹春霞(), 王荣波, 黄孝喜, 谌志群
杭州电子科技大学计算机学院 杭州 310018
Application of Text Clustering Method Based on Improved CFSFDP Algorithm
Zhan Chunxia(), Wang Rongbo, Huang Xiaoxi, Chen Zhiqun
School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China
全文: PDF (651 KB)   HTML ( 3
输出: BibTeX | EndNote (RIS)      
摘要 

目的】针对CFSFDP(Clustering by Fast Search and Find of Density Peaks)算法利用局部密度和距离的乘积选择聚类中心而导致聚类结果不理想的问题进行改进。【方法】提出一种基于粒子群算法的CFSFDP算法, 通过粒子群算法寻找CFSFDP算法中的最佳局部密度和距离阈值, 得到相对较高的局部密度和距离的聚类中心, 减少离散点对数据中心选取的影响, 并在某高考咨询平台提供的考生问题库中随机选取数据集进行试验。【结果】实验结果表明, 在不同的数据集中, 本文算法相对于基本的CFSFDP算法在准确率、召回率、F值上均有明显提高。【局限】文本处理时没有考虑语义关系。【结论】本文方法有很好的聚类效果, 应用在高考咨询库中能够有效地减轻被咨询方的工作量并且帮助快速回答考生的问题。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
詹春霞
王荣波
黄孝喜
谌志群
关键词 CFSDFP聚类中心粒子优化群算法    
Abstract

[Objective] This paper aims to improve the un-satisfactory performance of CFSFDP (clustering by fast search and find of density peaks) algorithm with the help of based on particle swarm optimization. [Methods] First, we determined the cluster centers by searching optimal local density and distance thresholds to increase the accuracy of results. These clustering centers have relatively high local density and distance, which reduced the influence of discrete points. Then, we examined the proposed method on a randomly selected dataset from the question-answer database of a college entrance exam consulting platform. [Results] The modified CFSFDP algorithm had better performance than the original one. [Limitations] We did not include the semantic relations to process the texts. [Conclusions] The proposed algorithm could achieve good clustering results, and improve the efficiency of the consulting personnel .

Key wordsCFSDFP    Cluster Centers    Particle Swarm Optimization Algorithm
收稿日期: 2016-12-30      出版日期: 2017-05-24
ZTFLH:  TP391  
基金资助:*本文系国家自然科学基金青年基金项目“引入涉身认知机制的汉语隐喻计算模型及其实现”(项目编号:61103101)、国家自然科学基金青年基金项目“基于马尔科夫树与DRT的汉语句群自动划分算法研究”(项目编号: 61202281)和教育部人文社会科学研究青年基金项目“面向信息处理的汉语隐喻计算研究”(项目编号: 10YJCZH052)的研究成果之一
引用本文:   
詹春霞, 王荣波, 黄孝喜, 谌志群. 基于改进CFSFDP算法的文本聚类方法及其应用*[J]. 数据分析与知识发现, 2017, 1(4): 94-99.
Zhan Chunxia,Wang Rongbo,Huang Xiaoxi,Chen Zhiqun. Application of Text Clustering Method Based on Improved CFSFDP Algorithm. Data Analysis and Knowledge Discovery, 2017, 1(4): 94-99.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2017.04.11      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2017/V1/I4/94
  散点图[4]
  决策图[4]
  算法流程

数据集
代码 军训 加分 极差 电话 省控线 退档
data1050 200 100 200 100 150 200 100
data3100 600 300 600 300 500 500 300
data5000 1 000 400 1000 400 900 900 400
  文本数据集
  聚类效果的比较
算法 数据集 Accuracy Precision Recall F-Measure
Agglomerative data1050 0.7305 0.7743 0.7969 0.7854
data3100 0.7077 0.6976 0.7811 0.7370
data5000 0.6808 0.6598 0.6627 0.6612
DBSCAN data1050 0.6486 0.6795 0.7332 0.7052
data3100 0.6797 0.6761 0.7880 0.7278
data5000 0.6006 0.6270 0.6500 0.6643
CFSFDP data1050 0.8171 0.8050 0.8090 0.8070
data3100 0.750 0.7375 0.6617 0.6975
data5000 0.7425 0.7438 0.6189 0.6756
本文算法 data1050 0.8333 0.7171 0.9098 0.8609
data3100 0.7574 0.7421 0.7676 0.7546
data5000 0.7712 0.7340 0.7450 0.7395
  4种算法的Accuracy、Precision、Recall、F-Measure值比较
[1] Tan P N, Steinbach M, Kuma V.Introduce to Data Mining[M]. Addison-Wesley Professional, 1988.
[2] 孙吉贵, 刘杰, 赵连宇. 聚类算法研究[J]. 软件学报, 2008, 19(1): 48-61.
[2] (Sun Jigui, Liu Jie, Zhao Lianyu.Clustering Algorithms Research[J]. Journal of Software, 2008, 19(1): 48-61.)
[3] 史梦洁. 文本聚类算法综述[J]. 现代计算机, 2014(2): 3-6.
[3] (Shi Mengjie.Summary of Text Clustering Algorithms[J]. Modern Computer, 2014(2): 3-6.)
[4] Rodriguez A, Laio A.Clustering by Fast Search and Find of Density Peaks[J]. Science, 2014, 344(6191): 1492-1496.
doi: 10.1126/science.1242072
[5] 张文开. 基于密度的层次聚类算法研究[D]. 合肥: 中国科学技术大学, 2015.
[5] (Zhang Wenkai.Research on Density- based Hierarchical Clustering Algorithm[D]. Hefei: University of Science and Technology of China, 2015.)
[6] Mehmood R, Bie R, Dawood H, et al.Fuzzy Clustering by Fast Search and Find of Density Peaks[C]//Proceedings of the 2015 International Conference on Identification, Information, and Knowledge in the Internet of Things. 2015.
[7] 马春来, 单洪, 马涛. 一种基于簇中心点自动选择策略的密度峰值聚类算法[J]. 计算机科学, 2016, 43(7): 255-258.
doi: 10.11896/j.issn.1002-137X.2016.7.046
[7] (Ma Chunlai, Shan Hong, Ma Tao.Improved Density Peaks Based Clustering Algorithm with Strategy Choosing Cluster Center Automatically[J]. Computer Science, 2016, 43(7): 255-258.)
doi: 10.11896/j.issn.1002-137X.2016.7.046
[8] Kennedy J, Eberhart R.Partical Swarm Optimization[C]// Proceeding of the 1995 IEEE International Conference on Neural Networks. 1995.
[9] 刘建华. 粒子群算法的基本理论及其改进研究[D]. 长沙: 中南大学, 2009.
[9] (Liu Jianhua.The Basic Theory of Partical Swarm Optimization and Its Improvement[D]. Changsha: Central South University,2009.)
[10] 黄承慧, 印鉴, 侯昉. 一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J]. 计算机学报, 2011, 34(5): 856-864.
doi: 10.3724/SP.J.1016.2011.00856
[10] (Huang Chenghui, Yin Jian, Hou Fang.A Text Similarity Measurement Combining Word Semantic Information with TF-IDF Method[J]. Chinese Journal of Computer, 2011, 34(5): 856-864.)
doi: 10.3724/SP.J.1016.2011.00856
[11] Aizawa A.An Information-treoretic Perspective of TF-IDF Measures[J]. Information Processing and Management, 2003, 39(1): 45-65.
doi: 10.1016/S0306-4573(02)00021-3
[12] Salton G, Buckley C.Term Weight Approaches in Automatic Text Retrieval[J]. Information Processing and Management, 1988, 24(5): 513-523.
doi: 10.1016/0306-4573(88)90021-0
[13] 谭静. 基于向量空间模型的文本相似度算法研究[D]. 成都: 西南石油大学, 2015.
[13] (Tan Jing.Research on Text Similarity Algorithm Based on Vector Space Modal[D]. Chengdu: Southwest Petroleum University, 2015.)
[14] 赵俊杰, 胡学钢. 基于文本分类的文档相似度计算[J].微型电脑应用, 2008, 24(12): 46-47.
doi: 10.3969/j.issn.1007-757X.2008.12.016
[14] (Zhao Junjie, Hu Xuegang.Simility Calculation Based on Text Classification[J]. Microcomputer Application, 2008, 24(12): 46-47.)
doi: 10.3969/j.issn.1007-757X.2008.12.016
[15] Halkidi M, Batistakis Y, Vazirgiannis M.On Clustering Validation Techniques[J]. Journal of Intelligent Information Systems, 2015, 17(2-3): 107-145.
[16] Liang J, Bai L, Dang C, et al.The K-Means-Type Algorithms Versus Imbalanced Data Distributions[J]. IEEE Transactions on Fuzzy Systems, 2012, 20(4): 728-745.
doi: 10.1109/TFUZZ.2011.2182354
[17] 张鸣. 符号数据聚类评价指标研究[D]. 太原: 山西大学, 2013.
[17] (Zhang Ming.Study on the Evaluation Index Symbol of Data Clustering[D]. Taiyuan: University of Shanxi, 2013.)
[18] Franti P, Virmajoki O, Hautamaki V.Fast Agglomerative Clustering Using a K-nearest Neighbor Graph[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2006, 28(11): 1875-1881.
doi: 10.1109/TPAMI.2006.227 pmid: 17063692
[19] 段明秀. 层次聚类算法的研究及应用[D]. 长沙:中南大学, 2009.
[19] (Duan Mingxiu.Research and Application of Hierarchical Clustering Algorithm[J]. Changsha: Central South University, 2009.)
[20] 冯少荣, 肖文俊. DBSCAN聚类算法的研究与改进[J].中国矿业大学学报, 2008, 37(1): 106-111.
[20] (Feng Shaorong, Xiao Wenjun.An Improved DBSCAN Clustering Algorithm[J]. Journal of China University of Mining & Technology, 2008, 37(1): 106-111.)
[1] 龚凯乐,成颖,孙建军. 基于参与者共现分析的博文聚类研究*[J]. 现代图书情报技术, 2016, 32(10): 50-58.
[2] 吴夙慧, 成颖, 郑彦宁, 潘云涛. K-means算法研究综述[J]. 现代图书情报技术, 2011, 27(5): 28-35.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn