Please wait a minute...
Advanced Search
数据分析与知识发现  2017, Vol. 1 Issue (12): 92-100     https://doi.org/10.11925/infotech.2096-3467.2017.0955
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
混合特征数据的自解释归约建模方法*
江思伟1,2, 谢振平1,2(), 陈梅婕1,2, 蔡明3
1江南大学数字媒体学院 无锡 214122
2江苏省媒体设计与软件技术重点实验室 无锡 214122
3江南大学信息化建设与管理中心 无锡 214122
Self-Explainable Reduction Method for Mixed Feature Data Modeling
Jiang Siwei1,2, Xie Zhenping1,2(), Chen Meijie1,2, Cai Ming3
1School of Digital Media, Jiangnan University, Wuxi 214122, China
2Jiangsu Key Laboratory of Media Design and Software Technology, Wuxi 214122, China
3Center of Informatization Development and Management, Jiangnan University, Wuxi 214122, China
全文: PDF (972 KB)   HTML ( 3
输出: BibTeX | EndNote (RIS)      
摘要 

目的】解决混合含有连续数值与标签特征量数据集的规则挖掘问题。【方法】提出数据集中特征维度间的互解释表示方法——自解释归约模型, 模型通过最大化新设计的自解释归约目标实现对连续数值数据的自适应划分建模。【结果】针对标准数据集、模拟规则挖掘问题、以及实际问题的实验分析表明, 本文方法具有显见的可行性及可用性, 是对现有数据建模与关联规则挖掘方法的有效扩展。【局限】计算效率一般, 还不能适应较大规模数据集的高速处理要求。【结论】技术方法上弥补了现有相关方法在解决混合特征数据建模问题时的局限性, 通过理论与实验分析证明新方法具有较强的创新性及实用性。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
江思伟
谢振平
陈梅婕
蔡明
关键词 混合型特征数据自解释归约数据建模数据挖掘    
Abstract

[Objective] This paper aims to mine the data with continuous numeric and label features. [Methods] We proposed a self-explainable reduction model to represent the data. The proposed model used the new reduction objective to create adaptive discrete division for continuous data dimension. [Results] We examined the new model with standard datasets and found it had better performance than the existing ones. [Limitations] The computational efficiency of the proposed method was not very impressive, which cannot meet the demand of large-scale data mining. [Conclusions] The proposed model is innovative and practical to model the mixed feature data.

Key wordsMixed Feature Data    Self-Explainable Reduction    Data Modeling    Data Mining
收稿日期: 2017-09-22      出版日期: 2017-12-29
ZTFLH:  TP393  
基金资助:*本文系国家科技支撑计划项目“影视制作云服务系统技术集成及应用示范”(项目编号: 2015BAH54F01)和江苏省自然科学基金项目“概率一致性保持的流数据约简及在线分类学习”(项目编号: BK20130161)的研究成果之一
引用本文:   
江思伟, 谢振平, 陈梅婕, 蔡明. 混合特征数据的自解释归约建模方法*[J]. 数据分析与知识发现, 2017, 1(12): 92-100.
Jiang Siwei,Xie Zhenping,Chen Meijie,Cai Ming. Self-Explainable Reduction Method for Mixed Feature Data Modeling. Data Analysis and Knowledge Discovery, 2017, 1(12): 92-100.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2017.0955      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2017/V1/I12/92
  自解释归约数据表示模型结构
数据集名称 连续量
属性数
标签量
属性数
数据类数 实例数
glass 9 0 9 214
wine-quality white 11 0 11 4 897
wine-quality red 11 0 11 1 599
dermatology 1 33 34 366
ionosphere 32 2 34 351
adult 5 9 14 32 562
  本文实验用UCI数据集
数据集 Na?veBayes 本文方法+Na?veBayes K-means+Na?veBayes FCM+Na?veBayes
glass 49.53 % 59.14±4.60% 61.92±4.12% 62.43±3.75%
wine-quality white 61.55% 67.97±3.31% 66.14±2.99% 65.87±3.32%
wine-quality red 55.35% 54.00±2.03% 58.16±0.98% 58.38±0.95%
dermatology 96.99% 96.94±0.11% 96.83±0.18% 96.78±0.11%
ionosphere 82.62 % 86.85±1.70% 84.63±1.70% 84.47±1.63%
adult 88.69% 92.47±0.82% 89.78±0.63 % 91.34±0.80%
  不同方法的分类精度结果
规则集 规则复杂度值
1 9.5996±0.0074
2 16.2276±0.0158
3 20.8350±0.0140
4 29.9389±0.0211
5 29.6517±0.0445
  模拟规则集的复杂度
  模拟规则4的规则关联图示
  规则4对应数据集的挖掘结果规则图示
设定规
则序号
本文算法结果
(交叉熵)
理想结果
(交叉熵)
相差度$\gamma $
1 0.2184±0.0710 0.1774±1.1532e-04 0.2309±0.4006
2 0.3947±0.0880 0.2640±2.1624e-04 0.4996±0.3260
3 0.2840±0.0743 0.2689±1.6490e-04 0.2617±0.1062
4 0.3309±0.0514 0.3554±1.9549e-04 0.1528±0.0940
5 0.2871±0.0325 0.3542±3.2526e-04 0.1929±0.0845
  模拟数据挖掘实验结果
  特征量F4的分布划分
  特征量F7的分布划分
语义化特征量 标签L1 标签L2 标签L3 标签Le (无记录)
F1 67.77% 31.83% 0.40% /
F2 41.97% 56.63% / 1.40%
F3 46.69% 43.57% 9.74% /
F4 29.52% 42.37% 28.11% /
F5 46.29% 50.90% / 2.81%
F6 20.98% 46.69% 32.33% /
F7 42.87% 43.88% 13.25% /
F8 47.79% 35.54% / 16.67%
F10 69.48% 30.52% / /
F11 43.07% 25.30% / 31.63%
F12 82.03% 17.97% / /
F13 85.14% 14.86% / /
  基于SRM的学生消费行为特征划分实验结果
  闭环关联规则
  星形关联规则
  一个典型的学生校园卡消费整体情况
  行为规则1特征图示
  行为规则2特征图示
  行为规则3特征图示
  行为规则4特征图示
[1] Agrawal R, Imieliński T, Swami A.Mining Association Rules Between Sets of Items in Large Databases[C]// Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. ACM, 1993: 207-216.
[2] Hsu C N, Huang H J, Wong T T.Why Discretization Works for Naive Bayesian Classifiers[C]// Proceedings of the 17th International Conference on Machine Learning. 2000: 399-406.
[3] García S, Luengo J, Sáez J A, et al. A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(4): 734-750.
doi: 10.1109/TKDE.2012.35
[4] Mahanta P, Ahmed H A, Kalita J K, et al.Discretization in Gene Expression Data Analysis: A Selected Survey[C]// Proceedings of the 2nd International Conference on Computational Science, Engineering and Information Technology. 2011: 69-75.
[5] Pearl J.Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference[J]. Computer Science Artificial Intelligence, 1988, 70(2): 1022-1027.
[6] Waugh N T, Muir D D.Improving the Life Cycle Management of Power Transformers Transforming Data to Life[C]//Proceedings of the 2015 SoutheastCon. IEEE, 2015: 1-7.
[7] Altaf W, Shahbaz M, Guergachi A.Applications of Association Rule Mining in Health Informatics: A Survey[J]. Artificial Intelligence Review, 2017, 47(3): 313-340.
doi: 10.1007/s10462-016-9483-9
[8] 阮光册, 夏磊. 基于关联规则的文本主题深度挖掘应用研究[J]. 现代图书情报技术, 2016(12): 50-56.
[8] (Ruan Gangce, Xia Lei.Mining Document Topics Based on Association Rules[J]. New Technology of Library and Information Service, 2016(12): 50-56.)
[9] 路永和, 曹利朝. 基于关联规则综合评价的图书推荐模型[J]. 现代图书情报技术, 2011(2): 81-86.
[9] (Lu Yonghe, Cao Lichao.Books Recommended Model Based on Association Rules Comprehensive Evaluation[J]. New Technology of Library and Information Service, 2011(2): 81-86.)
[10] Agrawal B R, Srikant R.A Fast Algorithm for Mining Association Rules[C]//Proceedings of the 20th International Conference on Very Large Data Bases. 1994: 21-30.
[11] Han J, Pei J, Yin Y.Mining Frequent Patterns Without Candidate Generation[J]. ACM SIGMOD Record, 2009, 29(2): 1-12.
[12] Zaki M J.Scalable Algorithms for Association Mining[J]. IEEE Transactions on Knowledge and Data Engineering, 2000, 12(3): 372-390.
doi: 10.1109/69.846291
[13] Qian G, Rao C R, Sun X, et al.Boosting Association Rule Mining in Large Datasets via Gibbs Sampling[J]. Proceedings of the National Academy of Sciences of the United States of America, 2016, 113(18): 4958-4963.
doi: 10.1073/pnas.1604553113 pmid: 27091963
[14] Sheng G, Hou H, Jiang X, et al.A Novel Association Rule Mining Method of Big Data for Power Transformers State Parameters Based on Probabilistic Graph Model[J]. IEEE Transactions on Smart Grid, 2016(99): 1.
doi: 10.1109/TSG.2016.2562123
[15] Li J, Le T D, Liu L, et al. From Observational Studies to Causal Rule Mining[J]. ACM Transactions on Intelligent Systems and Technology, 2016, 7(2): Article No. 14.
doi: 10.1145/2746410
[16] Song K, Lee K.Predictability-based Collective Class Association Rule Mining[J]. Expert Systems with Applications, 2017, 79: 1-7.
doi: 10.1016/j.eswa.2017.02.024
[17] Agbehadji I E, Fong S, Millham R.Wolf Search Algorithm for Numeric Association Rule Mining[C]//Proceedings of the 2016 IEEE International Conference on Cloud Computing and Big Data Analysis. IEEE, 2016: 146-151.
[18] Jorge A M, Azevedo P J.Optimal Leverage Association Rules with Numerical Interval Conditions[J]. Intelligent Data Analysis, 2012, 16(1): 25-47.
doi: 10.3233/IDA-2011-0509
[19] Rastogi R, Shim K.Mining Optimized Association Rules with Categorical and Numeric Attributes[J]. IEEE Transactions on Knowledge & Data Engineering, 2002, 14(1): 29-50.
doi: 10.1109/ICDE.1998.655813
[20] Biba M, Esposito F, Ferilli S, et al.Unsupervised Discretization Using Kernel Density Estimation[C]// Proceedings of the 2017 International Joint Conference on Artificial Intelligence, Hyderabad, India. 2008: 696-701.
[21] Schmidberger G, Frank E.Unsupervised Discretization Using Tree-based Density Estimation[C]//Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal. 2005.
[22] Shanmugapriya M, Nehemiah H K, Bhuvaneswaran R S, et al.Unsupervised Discretization: An Analysis of Classification Approaches for Clinical Datasets[J]. Research Journal of Applied Sciences Engineering & Technology, 2017, 14(2): 67-72.
doi: 10.19026/rjaset.14.3991
[23] Paninski L.Estimation of Entropy and Mutual Information[J]. Neural Computation, 2006, 15(6): 1191-1253.
doi: 10.1162/089976603321780272
[24] Ferguson T S.A Bayesian Analysis of Some Nonparametric Problems[J]. Annals of Statistics, 1973, 1(2): 209-230.
[25] Teh Y W, Jordan M I, Beal M J, et al.Hierarchical Dirichlet Processes[J]. Journal of the American Statistical Association, 2006, 101(476): 1566-1581.
[1] 谢旺, 王丽珍, 陈红梅, 曾兰清. 基于空间序偶模式挖掘污染源与癌症病例的关系 *[J]. 数据分析与知识发现, 2021, 5(2): 14-31.
[2] 张勇,李树青,程永上. 基于频次有效长度的加权关联规则挖掘算法研究 *[J]. 数据分析与知识发现, 2019, 3(7): 85-93.
[3] 陆泉,朱安琪,张霁月,陈静. 中文网络健康社区中的用户信息需求挖掘研究*——以求医网肿瘤板块数据为例[J]. 数据分析与知识发现, 2019, 3(4): 22-32.
[4] 牟冬梅,法慧,王萍,孙晶. 基于结构方程模型的疾病危险因素研究*[J]. 数据分析与知识发现, 2019, 3(4): 80-89.
[5] 李勇男. 贝叶斯理论在反恐情报分类分析中的应用研究*[J]. 数据分析与知识发现, 2018, 2(10): 9-14.
[6] 牟冬梅, 王萍, 赵丹宁. 高维电子病历的数据降维策略与实证研究*[J]. 数据分析与知识发现, 2018, 2(1): 88-98.
[7] 胡忠义, 王超群, 吴江. 融合多源网络评估数据及URL特征的钓鱼网站识别技术研究*[J]. 数据分析与知识发现, 2017, 1(6): 47-55.
[8] 牟冬梅,任珂. 三种数据挖掘算法在电子病历知识发现中的比较*[J]. 现代图书情报技术, 2016, 32(6): 102-109.
[9] 李峰,李书宁,于静. 面向院系的高校毕业生图书馆记忆系统[J]. 现代图书情报技术, 2016, 32(5): 99-103.
[10] 赵静娴. 基于决策树的网络伪舆情识别研究[J]. 现代图书情报技术, 2015, 31(6): 78-84.
[11] 何建民, 王哲. 社交网络话题信息传播影响簇发现谱系挖掘方法[J]. 现代图书情报技术, 2015, 31(5): 65-72.
[12] 黄文彬, 徐山川, 马龙, 王军. 利用通信数据的移动用户行为分析[J]. 现代图书情报技术, 2015, 31(5): 80-87.
[13] 郝玫, 王道平. 面向供应链的产品评论中客户关注特征挖掘方法研究[J]. 现代图书情报技术, 2014, 30(4): 65-70.
[14] 孙鸿飞, 侯伟. 改进TFIDF算法在潜在合作关系挖掘中的应用研究[J]. 现代图书情报技术, 2014, 30(10): 84-92.
[15] 李高虎, 高嵩, 唐小新, 曹红兵, 唐秋鸿. 个性化新书通报推荐系统的设计与实现[J]. 现代图书情报技术, 2012, 28(6): 89-93.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn