Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (2): 14-31     https://doi.org/10.11925/infotech.2096-3467.2020.1026
  专题 本期目录 | 过刊浏览 | 高级检索 |
基于空间序偶模式挖掘污染源与癌症病例的关系 *
谢旺,王丽珍(),陈红梅,曾兰清
云南大学信息学院 昆明 650500
Identifying Relationship Between Pollution Sources and Cancer Cases with Spatial Ordered Pair Patterns
Xie Wang,Wang Lizhen(),Chen Hongmei,Zeng Lanqing
School of Information Science and Engineering, Yunnan University, Kunming 650500, China
全文: PDF (1577 KB)   HTML ( 19
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 解决传统的空间co-location模式挖掘方法在研究类似污染源与癌症病例这两大类特征之间的关系时,会挖掘出大量用户不感兴趣的模式且只考虑模式的频繁性等问题。【方法】 首先,利用Voronoi图的性质结合星型实例模型,定义空间实例之间的邻近关系和空间序偶模式的概念;其次,考虑距离衰减效应和影响叠加效应,定义空间序偶模式的频繁度与影响度;最后提出了一个挖掘相应序偶模式的基本算法和一个优化算法。【结果】 所提挖掘算法均能挖掘出传统算法挖掘不到的用户感兴趣的结果,且结果数量比传统算法少很多,相比于基本算法,优化算法的剪枝率达到80%以上,数据集越大,效果越好。【局限】 默认数据都是点空间对象,扩展空间对象有待进一步研究。【结论】 空间序偶模式可以更好地研究类似污染源与癌症病例这两大类特征之间的关系。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
谢旺
王丽珍
陈红梅
曾兰清
关键词 空间数据挖掘空间序偶模式Voronoi图污染源癌症病例    
Abstract

[Objective] This paper tries to identify the relationship between pollution sources and cancer cases, aiming to address the issues of discovering too many non-pertnient patterns by method using spatial co-location patterns. [Methods] First, we combined the properties of Voronoi diagram and the star instance model. Then, we defined the proximity relationship between spatial instances and the concept of spatial ordered pair patterns. Third, we decided the prevalence and the influence of the spatial ordered pair patterns based on the distance attenuation and the influence superposition effects. Finally, we proposed a basic algorithm and an optimization algorithm to examine the spatial ordered pair patterns.[Results] The proposed algorithms revealed more pertinent relationship which cannot be identified by the traditional algorithms. And the total number of results was much less than those of the traditional algorithms. Compared with the basic algorithm, the pruning rate of the optimization algorithm surpassed 80%. The larger the data set, the better the results. [Limitations] The default data are all point-spatial objects, while the extended spatial objects merit more studies. [Conclusions] The spatial ordered pair patterns could effectively identify the relationship between pollution sources and cancer cases.

Key wordsSpatial Data Mining    Spatial Ordered Pair Pattern    Voronoi Diagram    Pollution Source    Cancer Case
收稿日期: 2020-10-21      出版日期: 2021-03-11
ZTFLH:  TP391  
基金资助:*国家自然科学基金项目(61966036);国家自然科学基金项目(61662086);云南省创新团队基金项目(2018HC019)
通讯作者: 王丽珍 ORCID:0000-0003-2214-2299     E-mail: lzhwang@ynu.edu.cn
引用本文:   
谢旺, 王丽珍, 陈红梅, 曾兰清. 基于空间序偶模式挖掘污染源与癌症病例的关系 *[J]. 数据分析与知识发现, 2021, 5(2): 14-31.
Xie Wang, Wang Lizhen, Chen Hongmei, Zeng Lanqing. Identifying Relationship Between Pollution Sources and Cancer Cases with Spatial Ordered Pair Patterns. Data Analysis and Knowledge Discovery, 2021, 5(2): 14-31.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.1026      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I2/14
Fig.1  空间特征及其实例分布示例
Fig.2  污染源实例集关于癌症特征a的Voronoi划分
Fig.3  污染源实例集关于癌症特征b的Voronoi划分
Fig.4  曲线函数 f(x)=(cosπx)/2+0.5的图像
数据类型 特征数 实例数 范围
癌症数据 26 5 238 经度102.5~105.5
纬度25~27
污染源数据 7 986
Table 1  真实数据集参数
Fig.5  真实数据集分布
参数 默认值
α 0.2
min_prev 0.3
min_pii 0.6
Table 2  默认参数说明
Fig.6  真实数据集中修正系数对候选模式数量的影响
Fig.7  真实数据集中修正系数对执行时间的影响
Fig.8  真实数据集中最小参与度阈值对候选模式数量的影响
Fig.9  真实数据集中最小参与度阈值对执行时间的影响
Fig.10  真实数据集中最小影响度阈值对候选模式数量的影响
Fig.11  真实数据集中最小影响度阈值对执行时间的影响
α 0.10 0.12 0.14 0.16 0.18
距离阈值(米) 2 324.49 2 789.39 3 254.28 3 719.18 4 184.08
算法2挖掘到模式数量 9 13 13 15 15
join-less算法挖掘到模式数量 227 344 531 897 1 290
fraction-score算法挖掘到模式数量 41 51 62 72 79
join-less算法挖掘到有意义模式数量 27 40 74 94 125
fraction-score算法挖掘到有意义模式数量 3 5 5 8 8
相同模式数量 0 0 0 1 1
Table 3  算法2、join-less算法和fraction-score算法挖掘结果的比较
模式阶 模式 PI PII
2阶 [{金属加工厂},{多系统继发性恶性肿瘤}] 0.7 0.69
[{化工厂},{多系统继发性恶性肿瘤}] 0.7 0.7
3阶 [{金属加工厂,化工厂},{多系统继发性恶性肿瘤}] 0.6 0.6
[{金属加工厂,纺织厂},{多系统继发性恶性肿瘤}] 0.6 0.6
[{金属加工厂,发电厂},{多系统继发性恶性肿瘤}] 0.6 0.6
[{化工厂,纺织厂},{多系统继发性恶性肿瘤}] 0.6 0.6
[{化工厂,发电厂},{多系统继发性恶性肿瘤}] 0.6 0.6
[{纺织厂,发电厂},{多系统继发性恶性肿瘤}] 0.6 0.66
Table 4  真实数据集上的部分挖掘结果
Fig.12  污染源实例与癌症实例分布
Fig.13  合成数据集中特征数对执行时间的影响
Fig.14  合成数据集中实例数对执行时间的影响
Fig.15  合成数据集中特征数和实例数同时对执行时间的影响
[1] Bray F, Ferlay J, Soerjomataram I, et al. Global Cancer Statistics 2018: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries[J]. CA: A Cancer Journal for Clinicians, 2018,68(6):394-424.
doi: 10.3322/caac.v68.6
[2] Chen W, Zheng R, Baade P D, et al. Cancer Statistics in China, 2015[J]. CA: A Cancer Journal for Clinicians, 2016,66(2):115-132.
doi: 10.3322/caac.21338
[3] 余艳琴, 乔友林. 人群肿瘤环境危险因素归因危险度概述[J]. 现代预防医学, 2019,46(1):162-165, 175.
[3] ( Yu Yanqin, Qiao Youlin. Attributable Risk Factors of Tumor Environmental, China[J]. Modern Preventive Medicine, 2019,46(1):162-165, 175.)
[4] Huang Y, Shekhar S, Xiong H. Discovering Colocation Patterns from Spatial Data Sets: A General Approach[J]. IEEE Transactions on Knowledge and Data Engineering, 2004,16(12):1472-1485.
doi: 10.1109/TKDE.2004.90
[5] Yoo J S, Shekhar S, Smith J, et al. A Partial Join Approach for Mining Co-location Patterns[C]//Proceedings of the 12th Annual ACM International Workshop on Geographic Information Systems (GIS), Washington. New York: ACM, 2004: 241-249.
[6] Yoo J S, Shekhar S, Celik M. A Join-Less Approach for Co-location Pattern Mining: A Summary of Results[C]//Proceedings of the 5th IEEE International Conference on Data Mining. IEEE, 2005: 813-816.
[7] Wang L, Bao Y, Lu Z. Efficient Discovery of Spatial Co-location Patterns Using the iCPI-tree[J]. The Open Information Systems Journal, 2009,3(2):69-80.
doi: 10.2174/1874133900903020069
[8] Wang L, Bao X, Chen H, et al. Effective Lossless Condensed Representation and Discovery of Spatial Co-location Patterns[J]. Information Sciences, 2018, 436-437:197-213.
doi: 10.1016/j.ins.2018.01.011
[9] Wang L, Bao X, Zhou L. Redundancy Reduction for Prevalent Co-location Patterns[J]. IEEE Transactions on Knowledge and Data Engineering, 2018,30(1):142-155.
doi: 10.1109/TKDE.69
[10] Tobler W R. A Computer Movie Simulating Urban Growth in the Detroit Region[J]. Economic Geography, 2016,46(1970):234-240.
doi: 10.2307/143141
[11] 胡新, 王丽珍, 周丽华, 等. 空间极大co-location模式挖掘研究[J]. 计算机科学与探索, 2014,8(2):150-160.
doi: 10.3778/j.issn.1673-9418.1306010
[11] ( Hu Xin, Wang Lizhen, Zhou Lihua, et al. Mining Spatial Maximal Co-location Patterns[J]. Journal of Frontiers of Computer Science and Technology, 2014,8(2):150-160.)
doi: 10.3778/j.issn.1673-9418.1306010
[12] 王光耀, 王丽珍, 杨培忠, 等. 极小负co-location模式及有效的挖掘算法[J]. 计算机科学与探索, 2021,15(2):366-378.
[12] ( Wang Guangyao, Wang Lizhen, Yang Peizhong, et al. Minimal Negative Co-location Patterns and Effective Mining Algorithm[J]. Journal of Frontiers of Computer and Technology, 2021,15(2):366-378.)
[13] Chan H K, Cheng L, Da Y, et al. Fraction-Score: A New Support Measure for Co-location Pattern Mining[C]//Proceedings of the 2019 IEEE 35th International Conference on Data Engineering. IEEE, 2019: 1514-1525.
[14] Wang L, Han J, Chen H, et al. Top-k Probabilistic Prevalent Co-location Mining in Spatially Uncertain Data Sets[J]. Frontiers of Computer Science, 2016,10(3):488-503.
doi: 10.1007/s11704-015-4196-9
[15] Wang L, Chen H, Zhao L, et al. Efficiently Mining Co-location Rules on Interval Data[C]//Proceedings of the 6th International Conference on Advanced Data Mining and Applications. Berlin: Springer, 2010: 477-488.
[16] Ouyang Z, Wang L, Wu P. Spatial Co-location Pattern Discovery from Fuzzy Objects[J]. International Journal on Artificial Intelligence Tools, 2017,26(2):1-20.
[17] Yang P, Wang L, Wang X, et al. An Effective Approach on Mining Co-location Patterns from Spatial Databases with Rare Features[C]//Proceedings of the 20th IEEE International Conference on Mobile Data Management. IEEE, 2019: 53-62.
[18] 王晓璇, 王丽珍, 陈红梅, 等. 基于特征效用参与率的空间高效用co-location模式挖掘方法[J]. 计算机学报, 2019,42(8):1721-1738.
[18] ( Wang Xiaoxuan, Wang Lizhen, Chen Hongmei, et al. Mining Spatial High Utility Co-location Patterns Based on Feature Utility Ratio[J]. Chinese Journal of Computers, 2019,42(8):1721-1738.)
[19] Ge Y, Yao Z, Li H. Computing Co-location Patterns in Spatial Data with Extended Objects: A Scalable Buffer-based Approach[J]. IEEE Transactions on Knowledge and Data Engineering, 2019.
doi: 10.1109/TKDE.2012.149 pmid: 24693210
[20] Tran V, Wang L. Delaunay Triangulation-based Spatial Co-location Pattern Mining Without Distance Thresholds[J]. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2020,13(3):282-304.
doi: 10.1002/sam.v13.3
[21] Qian F, He Q, Chiew K, et al. Spatial Co-location Pattern Discovery Without Thresholds[J]. Knowledge and Information Systems, 2012,33(2):419-445.
doi: 10.1007/s10115-012-0506-9
[22] Qian F, Chiew K, He Q, et al. Mining Regional Co-location Patterns with kNNG[J]. Journal of Intelligent Information Systems, 2014,42(3):485-505.
doi: 10.1007/s10844-013-0280-5
[23] Li J, Adilmagambetov A, Mohomed Jabbar M S, et al. On Discovering Co-location Patterns in Datasets: A Case Study of Pollutants and Child Cancers[J]. Geoinformatica, 2014,20(4):651-692.
doi: 10.1007/s10707-016-0254-1
[24] 储传鑫, 王丽珍, 周丽华, 等. 恶性肿瘤与工业污染之间的模糊关系挖掘[J]. 计算机科学与探索, 2020,14(12):2061-2071.
[24] ( Chu Chuanxin, Wang Lizhen, Zhou Lihua, et al. Mining the Fuzzy Relationship Between Malignant Tumors and Industrial Pollution[J]. Journal of Frontiers of Computer Science and Technology, 2020,14(12):2061-2071.)
[1] 葛登科,王亚民. 基于GIS的空间关联规则挖掘方法研究[J]. 现代图书情报技术, 2009, 25(7-8): 97-101.
[2] 孙万东,岳峻,张晶 . 基于本体理论的文献供应链知识表示及推理[J]. 现代图书情报技术, 2007, 2(12): 34-38.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn