Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (12): 84-91    DOI: 10.11925/infotech.2096-3467.2017.0724
Orginal Article Current Issue | Archive | Adv Search |
Uncertain Data Clustering Algorithm Based on Local Density
Luo Yanfu1, Qian Xiaodong2()
1School of Automation and Electrical Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China
2School of Economics and Management, Lanzhou Jiaotong University, Lanzhou 730070, China
Download: PDF (702 KB)   HTML ( 3
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new algorithm to cluster uncertain data, aiming to reduce the shortcomings inherited from the classic ones. [Methods] First, we modified the measurement of uncertain distance and compared the probability differences between two existing uncertain objects. Then, we defined the cluster centers and proposed a new algorithm to group the data into the related clusters based on the concepts of maximum supporting points and density chain regions. [Results] We used two data sets from the UCI machine learning library to examine the proposed algorithm. We found that the F values of the two data sets increased by 13.23% and 23.44% compared to traditional algorithm (UK-Means and FDBSCAN). It took the algorithm longer time to calculate the distance matrix. Therefore, the overall clustering time was only slightly shorter than the traditional algorithm. [Limitations] There was no appropriate method to define the parameter for the proposed algorithm, and the clustering time was complex. [Conclusions] The proposed algorithm could quickly determine the clustering centers and complete the clustering tasks. The value of t (the only parameter) poses much influence to the clustering results.

Key wordsUncertain Data      Cut-off Distance      Local Density      Density Chain Region     
Received: 24 July 2017      Published: 29 December 2017
ZTFLH:  TP393  

Cite this article:

Luo Yanfu,Qian Xiaodong. Uncertain Data Clustering Algorithm Based on Local Density. Data Analysis and Knowledge Discovery, 2017, 1(12): 84-91.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.0724     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I12/84

簇号 聚类中心(标号) 簇内点的个数 离群点个数
1 28 P: 50; N: 0 2
2 92 P: 50; N: 3
3 148 P: 28; N: 17
簇号 聚类中心(标号) 簇内点的个数 离群点个数
1 687 P: 4467; N: 0 0
2 829 P: 16635; N: 42
3 29878 P: 44473; N: 1940
算法 F值 运行时间(s)
Iris Connect-4 Iris Connect-4
UK-Means 0.8865 0.8017 0.0261 41.2505
FDBSCAN 0.7983 0.7430 0.0442 47.3241
本文算法 0.9854 0.9085 0.0250 37.9223
[1] 李建中, 王宏志, 高宏. 大数据可用性的研究进展[J]. 软件学报, 2016, 27(7): 1605-1625.
doi: 10.13328/j.cnki.jos.005038
[1] (Li Jianzhong, Wang Hongzhi, Gao Hong.State-of-the-Art of Research on Big Data Usability[J]. Journal of Software, 2016, 27(7): 1605-1625.)
doi: 10.13328/j.cnki.jos.005038
[2] Anagnostopoulos A, Dasgupta A, Kumar R.Approximation Algorithms for Co-Clustering[C]// Proceedings of the 27th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, 2008:201-210.
[3] Kanagal B, Deshpande A. Online Filtering, Smoothing and Probabilistic Modeling of Streaming Data[C]// Proceedings of the 24th International Conference on Data Engineering. IEEE, 2008:1160-1169.
[4] Ré C, Letchner J, Balazinksa M, et al.Event Queries on Correlated Probabilistic Streams[C]// Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. ACM, 2008:715-728.
[5] Chau M, Cheng R, Kao B, et al.Uncertain Data Mining: An Example in Clustering Location Data [A]// Advances in Knowledge Discovery and Data Mining[M]. Springer Berlin Heidelberg, 2006: 199-204.
[6] 刘位龙. 面向不确定性数据的聚类算法研究[D]. 济南: 山东师范大学, 2011.
[6] (Liu Weilong.Research on Clustering Algorithm for Uncertainty Data[D]. Ji’nan: Shandong Normal University, 2011.)
[7] Gullo F, Ponti G, Tagarelli A.Clustering Uncertain Data via K-Medoids [A]// Scalable Uncertainty Management[M]. Springer Berlin Heidelberg, 2008: 229-242.
[8] Xu H J, Li G H.Density-based Probabilistic Clustering of Uncertain Data[C]//Proceeedings of the 2008 International Conference on Computer Science and Software Engineering. 2008: 474-477.
[9] Kriegel H P, Pfeifle M.Hierarchical Density-Based Clustering of Uncertain Data[C]//Proceedings of the 5th IEEE Conference on Data Mining. 2005:689-692.
[10] Jiang B, Pei J, Tao Y, et al.Clustering Uncertain Data Based on Probability Distribution Similarity[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(4): 751-763.
doi: 10.1109/TKDE.2011.221
[11] 潘冬明, 黄德才. 基于相对密度的不确定数据聚类算法[J]. 计算机科学, 2015, 42(11A): 72-74.
[11] (Pan Dongming, Huang Decai.Relative Density-based Clustering Algorithm over Uncertain Data[J]. Computer Science, 2015, 42(11A): 72-74.)
[12] Liu H, Zhang X, Zhang X, et al.Self-adapted Mixture Distance Measure for Clustering Uncertain Data[J]. Knowledge-Based Systems, 2017, 126: 33-47.
doi: 10.1016/j.knosys.2017.04.002
[13] Gullo F, Ponti G, Tagarelli A, et al.An Information-Theoretic Approach to Hierarchical Clustering of Uncertain Data[J]. Information Sciences, 2017,402:199-215.
doi: 10.1016/j.ins.2017.03.030
[14] 迟荣华, 程媛, 朱素霞, 等. 基于快速高斯变换的不确定数据聚类算法[J]. 通信学报, 2017, 38(3): 101-111.
[14] (Chi Ronghua, Cheng Yuan, Zhu Suxia, et al.Uncertain Data Analysis Algorithm Based on Fast Gaussian Transform[J]. Journal of Communications, 2017, 38(3): 101-111)
[15] Rodriguez A, Laio A. Machine Learning.Clustering by Fast Search and Find of Density Peaks[J]. Science, 2014, 344(6191): 1492-1496.
[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Li Wenna,Zhang Zhixiong. Research on Knowledge Base Error Detection Method Based on Confidence Learning[J]. 数据分析与知识发现, 2021, 5(9): 1-9.
[3] Sun Yu, Qiu Jiangnan. Research on Influence of Opinion Leaders Based on Network Analysis and Text Mining [J]. 数据分析与知识发现, 0, (): 1-.
[4] Wang Qinjie, Qin Chunxiu, Ma Xubu, Liu Huailiang, Xu Cunzhen. Recommending Scientific Literature Based on Author Preference and Heterogeneous Information Network[J]. 数据分析与知识发现, 2021, 5(8): 54-64.
[5] Li Wenna, Zhang Zhixiong. Entity Alignment Method for Different Knowledge Repositories with Joint Semantic Representation[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[6] Wang Hao, Lin Kerou, Meng Zhen, Li Xinlei. Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[7] Yang Hanxun, Zhou Dequn, Ma Jing, Luo Yongcong. Detecting Rumors with Uncertain Loss and Task-level Attention Mechanism[J]. 数据分析与知识发现, 2021, 5(7): 101-110.
[8] Xu Yuemei, Wang Zihou, Wu Zixin. Predicting Stock Trends with CNN-BiLSTM Based Multi-Feature Integration Model[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[9] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[10] Wang Xiwei,Jia Ruonan,Wei Yanan,Zhang Liu. Clustering User Groups of Public Opinion Events from Multi-dimensional Social Network[J]. 数据分析与知识发现, 2021, 5(6): 25-35.
[11] Ruan Xiaoyun,Liao Jianbin,Li Xiang,Yang Yang,Li Daifeng. Interpretable Recommendation of Reinforcement Learning Based on Talent Knowledge Graph Reasoning[J]. 数据分析与知识发现, 2021, 5(6): 36-50.
[12] Liu Tong,Liu Chen,Ni Weijian. A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
[13] Chen Wenjie,Wen Yi,Yang Ning. Fuzzy Overlapping Community Detection Algorithm Based on Node Vector Representation[J]. 数据分析与知识发现, 2021, 5(5): 41-50.
[14] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[15] Yan Qiang,Zhang Xiaoyan,Zhou Simin. Extracting Keywords Based on Sememe Similarity[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn