Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (12): 84-91    DOI: 10.11925/infotech.2096-3467.2017.0724
Orginal Article Current Issue | Archive | Adv Search |
Uncertain Data Clustering Algorithm Based on Local Density
Luo Yanfu1, Qian Xiaodong2()
1School of Automation and Electrical Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China
2School of Economics and Management, Lanzhou Jiaotong University, Lanzhou 730070, China
Download: PDF (702 KB)   HTML ( 2
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new algorithm to cluster uncertain data, aiming to reduce the shortcomings inherited from the classic ones. [Methods] First, we modified the measurement of uncertain distance and compared the probability differences between two existing uncertain objects. Then, we defined the cluster centers and proposed a new algorithm to group the data into the related clusters based on the concepts of maximum supporting points and density chain regions. [Results] We used two data sets from the UCI machine learning library to examine the proposed algorithm. We found that the F values of the two data sets increased by 13.23% and 23.44% compared to traditional algorithm (UK-Means and FDBSCAN). It took the algorithm longer time to calculate the distance matrix. Therefore, the overall clustering time was only slightly shorter than the traditional algorithm. [Limitations] There was no appropriate method to define the parameter for the proposed algorithm, and the clustering time was complex. [Conclusions] The proposed algorithm could quickly determine the clustering centers and complete the clustering tasks. The value of t (the only parameter) poses much influence to the clustering results.

Key wordsUncertain Data      Cut-off Distance      Local Density      Density Chain Region     
Received: 24 July 2017      Published: 29 December 2017
ZTFLH:  TP393  

Cite this article:

Luo Yanfu,Qian Xiaodong. Uncertain Data Clustering Algorithm Based on Local Density. Data Analysis and Knowledge Discovery, 2017, 1(12): 84-91.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.0724     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I12/84

簇号 聚类中心(标号) 簇内点的个数 离群点个数
1 28 P: 50; N: 0 2
2 92 P: 50; N: 3
3 148 P: 28; N: 17
簇号 聚类中心(标号) 簇内点的个数 离群点个数
1 687 P: 4467; N: 0 0
2 829 P: 16635; N: 42
3 29878 P: 44473; N: 1940
算法 F值 运行时间(s)
Iris Connect-4 Iris Connect-4
UK-Means 0.8865 0.8017 0.0261 41.2505
FDBSCAN 0.7983 0.7430 0.0442 47.3241
本文算法 0.9854 0.9085 0.0250 37.9223
[1] 李建中, 王宏志, 高宏. 大数据可用性的研究进展[J]. 软件学报, 2016, 27(7): 1605-1625.
doi: 10.13328/j.cnki.jos.005038
[1] (Li Jianzhong, Wang Hongzhi, Gao Hong.State-of-the-Art of Research on Big Data Usability[J]. Journal of Software, 2016, 27(7): 1605-1625.)
doi: 10.13328/j.cnki.jos.005038
[2] Anagnostopoulos A, Dasgupta A, Kumar R.Approximation Algorithms for Co-Clustering[C]// Proceedings of the 27th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, 2008:201-210.
[3] Kanagal B, Deshpande A. Online Filtering, Smoothing and Probabilistic Modeling of Streaming Data[C]// Proceedings of the 24th International Conference on Data Engineering. IEEE, 2008:1160-1169.
[4] Ré C, Letchner J, Balazinksa M, et al.Event Queries on Correlated Probabilistic Streams[C]// Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. ACM, 2008:715-728.
[5] Chau M, Cheng R, Kao B, et al.Uncertain Data Mining: An Example in Clustering Location Data [A]// Advances in Knowledge Discovery and Data Mining[M]. Springer Berlin Heidelberg, 2006: 199-204.
[6] 刘位龙. 面向不确定性数据的聚类算法研究[D]. 济南: 山东师范大学, 2011.
[6] (Liu Weilong.Research on Clustering Algorithm for Uncertainty Data[D]. Ji’nan: Shandong Normal University, 2011.)
[7] Gullo F, Ponti G, Tagarelli A.Clustering Uncertain Data via K-Medoids [A]// Scalable Uncertainty Management[M]. Springer Berlin Heidelberg, 2008: 229-242.
[8] Xu H J, Li G H.Density-based Probabilistic Clustering of Uncertain Data[C]//Proceeedings of the 2008 International Conference on Computer Science and Software Engineering. 2008: 474-477.
[9] Kriegel H P, Pfeifle M.Hierarchical Density-Based Clustering of Uncertain Data[C]//Proceedings of the 5th IEEE Conference on Data Mining. 2005:689-692.
[10] Jiang B, Pei J, Tao Y, et al.Clustering Uncertain Data Based on Probability Distribution Similarity[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(4): 751-763.
doi: 10.1109/TKDE.2011.221
[11] 潘冬明, 黄德才. 基于相对密度的不确定数据聚类算法[J]. 计算机科学, 2015, 42(11A): 72-74.
[11] (Pan Dongming, Huang Decai.Relative Density-based Clustering Algorithm over Uncertain Data[J]. Computer Science, 2015, 42(11A): 72-74.)
[12] Liu H, Zhang X, Zhang X, et al.Self-adapted Mixture Distance Measure for Clustering Uncertain Data[J]. Knowledge-Based Systems, 2017, 126: 33-47.
doi: 10.1016/j.knosys.2017.04.002
[13] Gullo F, Ponti G, Tagarelli A, et al.An Information-Theoretic Approach to Hierarchical Clustering of Uncertain Data[J]. Information Sciences, 2017,402:199-215.
doi: 10.1016/j.ins.2017.03.030
[14] 迟荣华, 程媛, 朱素霞, 等. 基于快速高斯变换的不确定数据聚类算法[J]. 通信学报, 2017, 38(3): 101-111.
[14] (Chi Ronghua, Cheng Yuan, Zhu Suxia, et al.Uncertain Data Analysis Algorithm Based on Fast Gaussian Transform[J]. Journal of Communications, 2017, 38(3): 101-111)
[15] Rodriguez A, Laio A. Machine Learning.Clustering by Fast Search and Find of Density Peaks[J]. Science, 2014, 344(6191): 1492-1496.
[1] Chenglei Qin, Chengzhi Zhang. Using Hierarchical Attention Network Model to Recognize Structure Functions of Academic Articles [J]. 数据分析与知识发现, 0, (): 1-.
[2] Shen Zhihong,Zhao Zihao,Wang Haibo. Big Data Technology Stack Shifting: From SQL Centric to Graph Centric[J]. 数据分析与知识发现, 2020, 4(7): 50-65.
[3] Chen Dong,Wang Jiandong,Li Huiying,Cai Sihang,Huang Qianqian,Yi Chengqi,Cao Pan. Forecasting Poultry Turnovers with Machine Learning and Multiple Factors[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
[4] Xu Yicong,Tian Xuedong,Li Xinfu,Yang Fang,Shi Qingxuan. Retrieving Mathematical Expressions Based on Hesitant Fuzzy Weight[J]. 数据分析与知识发现, 2020, 4(7): 118-126.
[5] Liang Ye,Li Xiaoyuan,Xu Hang,Hu Yiran. CLOpin: A Cross-Lingual Knowledge Graph Framework for Public Opinion Analysis and Early Warning[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[6] Liu Weijiang,Wei Hai,Yun Tianhe. Evaluation Model for Customer Credits Based on Convolutional Neural Network[J]. 数据分析与知识发现, 2020, 4(6): 80-90.
[7] Yu Fengchang,Lu Wei. Constructing Data Set for Location Annotations of Academic Literature Figures and Tables[J]. 数据分析与知识发现, 2020, 4(6): 35-42.
[8] Qian Liu, Chenliang Li. A Survey of Topic Evolution on Social Media [J]. 数据分析与知识发现, 0, (): 1-.
[9] Shen Zhe, Wang Yi, Yao Yifan, Cheng Ying. Author Name Disambiguation Techniques for the Academic Literature: a Review [J]. 数据分析与知识发现, 0, (): 1-.
[10] Zhao Ping,Sun Lianying,Tu Shuai,Bian Jianling,Wan Ying. Identifying Scenic Spot Entities Based on Improved Knowledge Transfer[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
[11] Li Chengliang,Zhao Zhongying,Li Chao,Qi Liang,Wen Yan. Extracting Product Properties with Dependency Relationship Embedding and Conditional Random Field[J]. 数据分析与知识发现, 2020, 4(5): 54-65.
[12] Zhu Lu,Tian Xiaomeng,Cao Sainan,Liu Yuanyuan. Subspace Cross-modal Retrieval Based on High-Order Semantic Correlation[J]. 数据分析与知识发现, 2020, 4(5): 84-91.
[13] Ye Guanghui,Zeng Jieyan,Hu Jinglan,Bi Chongwu. Analyzing Public Sentiments from the Perspective of City Profiles[J]. 数据分析与知识发现, 2020, 4(4): 15-26.
[14] Yu Chuanming,Yuan Sai,Zhu Xingyu,Lin Hongjun,Zhang Puliang,An Lu. Research on Deep Learning Based Topic Representation of Hot Events[J]. 数据分析与知识发现, 2020, 4(4): 1-14.
[15] Zeng Zhen, Li Gang, Mao Jin, Chen Jinghao. Research on Regional Public Security Data Governance and Process Domain Ontology [J]. 数据分析与知识发现, 0, (): 1-.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn