Please wait a minute...
Advanced Search
数据分析与知识发现  2017, Vol. 1 Issue (11): 12-18     https://doi.org/10.11925/infotech.2096-3467.2017.0544
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于关联分类算法的PU学习研究
杨建林(), 刘扬
南京大学信息管理学院 南京 210023
江苏省数据工程与知识服务重点实验室 南京 210023
Evaluating PU Learning Based on Associative Classification Algorithm
Yang Jianlin(), Liu Yang
School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
全文: PDF (444 KB)   HTML ( 5
输出: BibTeX | EndNote (RIS)      
摘要 

目的】基于常用的关联分类算法CBA进行PU学习研究。【方法】将训练集中比例为α的正样本作为未被识别出的正样本, 与负样本一起组成未标记样本集, 从而构建PU学习场景。其中, 基于全部正类别分类关联规则对样本进行分类, 并使用分类关联规则相对置信度衡量分类关联规则分类结果的可信度。【结果】当α取值分别为0、0.3、0.6、0.9时, 在实验数据集上, 本文方法的分类结果的AUC值较CBA算法分别平均提高6.21%、11.15%、13.50%、16.56%, 较POSC4.5算法分别平均提高11.27%、15.03%、12.22%、7.37%。【局限】由于未对全部样本中真实正样本所占的比例进行估计, 并据此对分类关联规则的置信度进行修正, 因而所提方法的分类效果随α取值的增长呈下降趋势。此外, CBA算法会产生大量的冗余规则, 而本文并未对其中的规则进行筛选。【结论】本文方法在PU学习场景中的分类效果优于CBA算法和POSC4.5算法。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
杨建林
刘扬
关键词 关联分类PU学习CBA算法    
Abstract

[Objective] We examine the PU learning with the associative classification algorithm CBA. [Methods] First, we categorized α% of positive examples as unidentified positive examples, which were used to construct the corpus along with negative samples. Then, we classified examples based on all positive class association rules. Finally, we evaluated the reliability of class association rules with relative confidence. [Results] We used 0%, 30%, 60%, and 90% as the values of α. Compared to CBA, the AUC of the proposed PU learning algorithm were increased by 6.21%、11.15%、13.50% and 16.56%. Compared to POSC4.5, the AUC increased by 11.27%、15.03%、12.22%, and 7.37%. [Limitations] We did not modify the confidence of the class association rules based on the estimated proportion of positive examples. We found that the classification accuracy of the proposed PU learning algorithm gradually decreased while the value of α increased. We did not investigate the redundant rules of the CBA algorithm. [Conclusions] The proposed PU learning algorithm did better jobs than CBA and POSC4.5 algorithms.

Key wordsAssociative Classification    PU Learning    CBA Algorithm
收稿日期: 2017-06-12      出版日期: 2017-11-27
ZTFLH:  TP311 G35  
引用本文:   
杨建林, 刘扬. 基于关联分类算法的PU学习研究[J]. 数据分析与知识发现, 2017, 1(11): 12-18.
Yang Jianlin,Liu Yang. Evaluating PU Learning Based on Associative Classification Algorithm. Data Analysis and Knowledge Discovery, 2017, 1(11): 12-18.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2017.0544      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2017/V1/I11/12
  本文方法主要步骤
数据集名称 样本数量 正、负样本数量比
adult 48 842 11687:37155
breast 699 241:458
cylBands 540 228:312
hepatitis 155 32:123
horseColic 368 136:232
mushroom 8 124 3916:4208
pima 768 268:500
ticTacoe 958 332:626
bank 41 188 4640:36548
default 30 000 6636:23364
  UCI实验数据集
数据集
名称
样本
数量
正、负样本
数量比
特征数量
分箱前 分箱后
adult 48 842 11687:37155 15 97
bank 41 188 4640:36548 21 84
breast 699 241:458 11(含1个无用特征) 20
cylBands 540 228:312 40(含4个无用特征) 124
default 30 000 6636:23364 24 132
hepatitis 155 32:123 20 56
horseColic 368 136:232 28 85
mushroom 8 124 3916:4208 23 90
pima 768 268:500 9 38
ticTacoe 958 332:626 10 29
  分箱前后各实验数据集的特征数量
数据集
名称
α = 0 α = 0.3 α = 0.6 α = 0.9
CBA
_AUC
POSC4.5
_AUC
PU
_AUC
CBA
_AUC
POSC4.5
_AUC
PU
_AUC
CBA
_AUC
POSC4.5
_AUC
PU
_AUC
CBA
_AUC
POSC4.5
_AUC
PU
_AUC
adult 0.872 0.647 0.863 0.857 0.655 0.870 0.857 0.696 0.839 0.851 0.630 0.683
bank 0.723 0.829 0.753 0.721 0.788 0.773 0.715 0.819 0.768 0.633 0.824 0.751
breast 0.933 0.945 0.966 0.938 0.907 0.965 0.955 0.907 0.963 0.953 0.664 0.8641
cylBands 0.747 0.593 0.820 0.627 0.5 0.802 0.602 0.602 0.757 0.512 0.547 0.703
default 0.703 0.604 0.735 0.698 0.5 0.738 0.694 0.548 0.731 0.661 0.540 0.701
hepatitis 0.615 0.742 0.800 0.600 0.806 0.790 0.625 0.5 0.775 0.554 0.511 0.747
horseColic 0.749 0.790 0.887 0.701 0.790 0.811 0.512 0.790 0.667 0.503 0.790 0.595
mushroom 1.0 0.998 0.999 0.999 0.989 0.996 0.987 0.946 0.982 0.876 0.983 0.781
pima 0.751 0.681 0.760 0.701 0.711 0.753 0.668 0.703 0.702 0.399 0.594 0.684
ticTacoe 1.0 0.719 0.915 0.810 0.5 0.905 0.599 0.5 0.832 0.534 0.5 0.635
  CBA算法、POSC4.5算法及本文方法在不同PU学习场景中的分类效果
提升 α = 0 α = 0.3 α = 0.6 α = 0.9
AUC平均增加量 6.21% 11.15% 13.50% 16.56%
  本文方法较CBA算法的AUC值平均提升程度
提升 α = 0 α = 0.3 α = 0.6 α = 0.9
AUC平均增加量 11.27% 15.03% 12.22% 7.37%
  本文方法较POSC4.5算法的AUC值平均提升程度
[1] Denis F.PAC Learning from Positive Statistical Queries[A]// Algorithmic Learning Theory[M]. Springer Berlin Heidelberg, 1998: 112-126.
[2] 潘世瑞, 张阳, 李雪, 等. 针对不确定正例和未标记学习的最近邻算法[J]. 计算机科学与探索, 2010, 4(9): 769-779.
doi: 10.3778/j.issn.1673-9418.2010.09.001
[2] (Pan Shirui, Zhang Yang, Li Xue, et al.Nearest Neighbor Algorithm for Positive and Unlabeled Learning with Uncertainty[J]. Journal of Frontiers of Computer Science and Technology, 2010, 4(9): 769-779.)
doi: 10.3778/j.issn.1673-9418.2010.09.001
[3] Schölkopf B, Platt J C, Shawe-Taylor J, et al.Estimating the Support of a High-dimensional Distribution[J]. Neural Computation, 2001, 13(7): 1443-1471.
doi: 10.1162/089976601750264965 pmid: 11440593
[4] Yu H, Han J, Chang K C C. PEBL: Positive Example Based Learning for Web Page Classification Using SVM[C]// Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2002: 239-248.
[5] 何佳珍. 不确定数据的PU学习贝叶斯分类器研究[D]. 咸阳: 西北农林科技大学, 2012.
[5] (He Guizhen.Bayesian Classification for Positive Unlabeled Learning with Uncertainty[D]. Xianyang: Northwest A&F University, 2012.)
[6] 张星. 不确定数据的PU学习决策树研究[D]. 咸阳: 西北农林科技大学, 2012.
[6] (Zhang Xing.Research on Decision Tree for Mining Uncertain Data with PU-learning[D]. Xianyang: Northwest A&F University, 2012.)
[7] 胡颢继. 基于数据分布和文本相似性的PU分类技术[D]. 上海: 华东师范大学, 2014.
[7] (Hu Haoji.A Classification Method for PU Problem Based on Data Distribution and Text Similarity[D]. Shanghai: East China Normal University, 2014.)
[8] 张邦佐. 基于正例和无标记样例学习研究[D]. 长春: 吉林大学, 2009.
[8] (Zhang Bangzuo.A Study on Learning from Positive and Unlabeled Examples[D]. Changchun: Jilin University, 2009.)
[9] Liu B, Lee W S, Yu P S, et al.Partially Supervised Classification of Text Documents[C]// Proceedings of the 19th International Conference on Machine Learning. 2002.
[10] Fung G P C, Yu J X, Lu H, et al. Text Classification Without Negative Examples Revisit[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 6-20.
doi: 10.1109/TKDE.2006.16
[11] 许震. 基于KL距离的半监督分类算法[D]. 上海: 复旦大学, 2010.
[11] (Xu Zhen.Semi-supervised Classification Based on KL Divergence[D]. Shanghai: Fudan University, 2010.)
[12] Györfi L, Gyorfi Z, Vajda I.Bayesian Decision with Rejection[J]. Problems of Control and Information Theory, 1979, 8(5-6): 445-452.
[13] Chawla N V, Karakoulas G.Learning from Labeled and Unlabeled Data: An Empirical Study Across Techniques and Domains[J]. Journal of Artificial Intelligence Research, 2005, 23: 331-366.
doi: 10.1613/jair.1509
[14] Jain S, White M, Radivojac P.Estimating the Class Prior and Posterior from Noisy Positives and Unlabeled Data[C]// Proceedings of the 30th Annual Conference on Neural Information Processing Systems. 2016.
[15] Natarajan N.Learning with Positive and Unlabeled Examples[D]. Austin: The University of Texas at Austin, 2015.
[16] Lee W S, Liu B.Learning with Positive and Unlabeled Examples Using Weighted Logistic Regression[C]// Proceedings of the 20th International Conference on Machine Learning. 2003.
[17] Letouzey F, Denis F, Gilleron R.Learning from Positive and Unlabeled Examples[A]// Algorithmic Learning Theory[M]. Springer Berlin Heidelberg, 2000: 71-85.
[18] De Comité F, Denis F, Gilleron R, et al.Positive and Unlabeled Examples Help Learning[A]// Algorithmic Learning Theory[M]. Springer Berlin Heidelberg, 1999: 219-230.
[19] Liu B, Dai Y, Li X, et al.Building Text Classifiers Using Positive and Unlabeled Examples[C]// Proceedings of the 3rd IEEE International Conference on Data Mining. IEEE, 2003: 179-186.
[20] Liu B, Hsu W, Ma Y.Integrating Classification and Association Rule Mining[C]// Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. 1998.
[21] 黄再祥, 周忠眉, 何田中, 等. 改进的多类不平衡数据关联分类算法[J]. 模式识别与人工智能, 2015, 28(10): 922-929.
doi: 10.16451/j.cnki.issn1003-6059.201510007
[21] (Huang Zaixiang, Zhou Zhongmei, He Tianzhong, et al.Improved Associative Classification Algorithm for Multiclass Imbalanced Datasets[J]. Pattern Recognition and Artificial Intelligence, 2015, 28(10): 922-929.)
doi: 10.16451/j.cnki.issn1003-6059.201510007
[22] 李硕. PU学习场景下代价敏感数据流分类算法研究[D]. 咸阳:西北农业科技大学, 2015.
[22] (Li Shuo.Research on Algorithm of Cost-sensitive Data Stream Classification Under PU Learning Scenario [D]. Xianyang: Northwest A&F University, 2015.)
[23] Dong G, Zhang X, Wong L, et al.CAEP: Classification by Aggregating Emerging Patterns[A]// Discovery Science[M]. Springer Berlin Heidelberg, 1999: 30-42.
[24] UCI Machine Learning Repository[EB/OL]. [2017-03-26]. .
[25] LUCS-KDD Implementation of CBA[EB/OL]. [2017-03-26]. .
[26] Machine Learning Group at the University of Waikato. Weka [EB/OL]. [2017-04-12]. .
[27] LUCS-KDD DN Software [EB/OL]. [2017-03-26]..
[28] Fawcett T.An Introduction to ROC Analysis[J]. Pattern Recognition Letters, 2006, 27(8): 861-874.
doi: 10.1016/j.patrec.2005.10.010
[29] 刘红梅. 基于关联规则的分类方法初探[J]. 电脑知识与技术, 2009, 5(3): 535-536.
doi: 10.3969/j.issn.1009-3044.2009.03.009
[29] (Liu Hongmei.Research of Association Rule Classification[J]. Computer Knowledge and Technology, 2009, 5(3): 535-536.)
doi: 10.3969/j.issn.1009-3044.2009.03.009
[30] Zaïane O, Antonie M L.On Pruning and Tuning Rules for Associative Classifiers[A]// Knowledge-based Intelligent Information and Engineering Systems[M]. Springer Berlin Heidelberg, 2005.
[1] 韩普, 王鹏. 基于无标度网络模型和传染病模型的舆论演化仿真研究*[J]. 数据分析与知识发现, 2017, 1(10): 53-63.
[2] 申雪锋, 柯永振, 姚楠. 多视图合作的联盟数据可视化分析[J]. 数据分析与知识发现, 2017, 1(3): 21-28.
[3] 邢美凤, 许德山. 可视化的共词聚类系统分析及实现[J]. 现代图书情报技术, 2011, 27(7/8): 62-67.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn