基于关联分类算法的PU学习研究

doi:10.11925/infotech.2096-3467.2017.0544

数据分析与知识发现

2017, Vol. 1

Issue (11): 12-18 https://doi.org/10.11925/infotech.2096-3467.2017.0544

研究论文

本期目录 | 过刊浏览 | 高级检索

基于关联分类算法的PU学习研究

杨建林(

), 刘扬

南京大学信息管理学院南京 210023
江苏省数据工程与知识服务重点实验室南京 210023

Evaluating PU Learning Based on Associative Classification Algorithm

Yang Jianlin(

), Liu Yang

School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (444 KB) HTML ( 5 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】基于常用的关联分类算法CBA进行PU学习研究。【方法】将训练集中比例为α的正样本作为未被识别出的正样本, 与负样本一起组成未标记样本集, 从而构建PU学习场景。其中, 基于全部正类别分类关联规则对样本进行分类, 并使用分类关联规则相对置信度衡量分类关联规则分类结果的可信度。【结果】当α取值分别为0、0.3、0.6、0.9时, 在实验数据集上, 本文方法的分类结果的AUC值较CBA算法分别平均提高6.21%、11.15%、13.50%、16.56%, 较POSC4.5算法分别平均提高11.27%、15.03%、12.22%、7.37%。【局限】由于未对全部样本中真实正样本所占的比例进行估计, 并据此对分类关联规则的置信度进行修正, 因而所提方法的分类效果随α取值的增长呈下降趋势。此外, CBA算法会产生大量的冗余规则, 而本文并未对其中的规则进行筛选。【结论】本文方法在PU学习场景中的分类效果优于CBA算法和POSC4.5算法。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	杨建林
	刘扬

关键词 ：关联分类, PU学习, CBA算法

Abstract：

[Objective] We examine the PU learning with the associative classification algorithm CBA. [Methods] First, we categorized α% of positive examples as unidentified positive examples, which were used to construct the corpus along with negative samples. Then, we classified examples based on all positive class association rules. Finally, we evaluated the reliability of class association rules with relative confidence. [Results] We used 0%, 30%, 60%, and 90% as the values of α. Compared to CBA, the AUC of the proposed PU learning algorithm were increased by 6.21%、11.15%、13.50% and 16.56%. Compared to POSC4.5, the AUC increased by 11.27%、15.03%、12.22%, and 7.37%. [Limitations] We did not modify the confidence of the class association rules based on the estimated proportion of positive examples. We found that the classification accuracy of the proposed PU learning algorithm gradually decreased while the value of α increased. We did not investigate the redundant rules of the CBA algorithm. [Conclusions] The proposed PU learning algorithm did better jobs than CBA and POSC4.5 algorithms.

Key words： Associative Classification PU Learning CBA Algorithm

收稿日期: 2017-06-12 出版日期: 2017-11-27

ZTFLH:

TP311 G35

引用本文:

杨建林, 刘扬. 基于关联分类算法的PU学习研究[J]. 数据分析与知识发现, 2017, 1(11): 12-18.
Yang Jianlin,Liu Yang. Evaluating PU Learning Based on Associative Classification Algorithm. Data Analysis and Knowledge Discovery, 2017, 1(11): 12-18.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2017.0544 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2017/V1/I11/12

本文方法主要步骤

UCI实验数据集

分箱前后各实验数据集的特征数量

CBA算法、POSC4.5算法及本文方法在不同PU学习场景中的分类效果

本文方法较CBA算法的AUC值平均提升程度

本文方法较POSC4.5算法的AUC值平均提升程度

[1]	Denis F.PAC Learning from Positive Statistical Queries[A]// Algorithmic Learning Theory[M]. Springer Berlin Heidelberg, 1998: 112-126.
[2]	潘世瑞, 张阳, 李雪, 等. 针对不确定正例和未标记学习的最近邻算法[J]. 计算机科学与探索, 2010, 4(9): 769-779. doi: 10.3778/j.issn.1673-9418.2010.09.001
[2]	(Pan Shirui, Zhang Yang, Li Xue, et al.Nearest Neighbor Algorithm for Positive and Unlabeled Learning with Uncertainty[J]. Journal of Frontiers of Computer Science and Technology, 2010, 4(9): 769-779.) doi: 10.3778/j.issn.1673-9418.2010.09.001
[3]	Schölkopf B, Platt J C, Shawe-Taylor J, et al.Estimating the Support of a High-dimensional Distribution[J]. Neural Computation, 2001, 13(7): 1443-1471. doi: 10.1162/089976601750264965 pmid: 11440593
[4]	Yu H, Han J, Chang K C C. PEBL: Positive Example Based Learning for Web Page Classification Using SVM[C]// Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2002: 239-248.
[5]	何佳珍. 不确定数据的PU学习贝叶斯分类器研究[D]. 咸阳: 西北农林科技大学, 2012.
[5]	(He Guizhen.Bayesian Classification for Positive Unlabeled Learning with Uncertainty[D]. Xianyang: Northwest A&F University, 2012.)
[6]	张星. 不确定数据的PU学习决策树研究[D]. 咸阳: 西北农林科技大学, 2012.
[6]	(Zhang Xing.Research on Decision Tree for Mining Uncertain Data with PU-learning[D]. Xianyang: Northwest A&F University, 2012.)
[7]	胡颢继. 基于数据分布和文本相似性的PU分类技术[D]. 上海: 华东师范大学, 2014.
[7]	(Hu Haoji.A Classification Method for PU Problem Based on Data Distribution and Text Similarity[D]. Shanghai: East China Normal University, 2014.)
[8]	张邦佐. 基于正例和无标记样例学习研究[D]. 长春: 吉林大学, 2009.
[8]	(Zhang Bangzuo.A Study on Learning from Positive and Unlabeled Examples[D]. Changchun: Jilin University, 2009.)
[9]	Liu B, Lee W S, Yu P S, et al.Partially Supervised Classification of Text Documents[C]// Proceedings of the 19th International Conference on Machine Learning. 2002.
[10]	Fung G P C, Yu J X, Lu H, et al. Text Classification Without Negative Examples Revisit[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 6-20. doi: 10.1109/TKDE.2006.16
[11]	许震. 基于KL距离的半监督分类算法[D]. 上海: 复旦大学, 2010.
[11]	(Xu Zhen.Semi-supervised Classification Based on KL Divergence[D]. Shanghai: Fudan University, 2010.)
[12]	Györfi L, Gyorfi Z, Vajda I.Bayesian Decision with Rejection[J]. Problems of Control and Information Theory, 1979, 8(5-6): 445-452.
[13]	Chawla N V, Karakoulas G.Learning from Labeled and Unlabeled Data: An Empirical Study Across Techniques and Domains[J]. Journal of Artificial Intelligence Research, 2005, 23: 331-366. doi: 10.1613/jair.1509
[14]	Jain S, White M, Radivojac P.Estimating the Class Prior and Posterior from Noisy Positives and Unlabeled Data[C]// Proceedings of the 30th Annual Conference on Neural Information Processing Systems. 2016.
[15]	Natarajan N.Learning with Positive and Unlabeled Examples[D]. Austin: The University of Texas at Austin, 2015.
[16]	Lee W S, Liu B.Learning with Positive and Unlabeled Examples Using Weighted Logistic Regression[C]// Proceedings of the 20th International Conference on Machine Learning. 2003.
[17]	Letouzey F, Denis F, Gilleron R.Learning from Positive and Unlabeled Examples[A]// Algorithmic Learning Theory[M]. Springer Berlin Heidelberg, 2000: 71-85.
[18]	De Comité F, Denis F, Gilleron R, et al.Positive and Unlabeled Examples Help Learning[A]// Algorithmic Learning Theory[M]. Springer Berlin Heidelberg, 1999: 219-230.
[19]	Liu B, Dai Y, Li X, et al.Building Text Classifiers Using Positive and Unlabeled Examples[C]// Proceedings of the 3rd IEEE International Conference on Data Mining. IEEE, 2003: 179-186.
[20]	Liu B, Hsu W, Ma Y.Integrating Classification and Association Rule Mining[C]// Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. 1998.
[21]	黄再祥, 周忠眉, 何田中, 等. 改进的多类不平衡数据关联分类算法[J]. 模式识别与人工智能, 2015, 28(10): 922-929. doi: 10.16451/j.cnki.issn1003-6059.201510007
[21]	(Huang Zaixiang, Zhou Zhongmei, He Tianzhong, et al.Improved Associative Classification Algorithm for Multiclass Imbalanced Datasets[J]. Pattern Recognition and Artificial Intelligence, 2015, 28(10): 922-929.) doi: 10.16451/j.cnki.issn1003-6059.201510007
[22]	李硕. PU学习场景下代价敏感数据流分类算法研究[D]. 咸阳:西北农业科技大学, 2015.
[22]	(Li Shuo.Research on Algorithm of Cost-sensitive Data Stream Classification Under PU Learning Scenario [D]. Xianyang: Northwest A&F University, 2015.)
[23]	Dong G, Zhang X, Wong L, et al.CAEP: Classification by Aggregating Emerging Patterns[A]// Discovery Science[M]. Springer Berlin Heidelberg, 1999: 30-42.
[24]	UCI Machine Learning Repository[EB/OL]. [2017-03-26]. .
[25]	LUCS-KDD Implementation of CBA[EB/OL]. [2017-03-26]. .
[26]	Machine Learning Group at the University of Waikato. Weka [EB/OL]. [2017-04-12]. .
[27]	LUCS-KDD DN Software [EB/OL]. [2017-03-26]..
[28]	Fawcett T.An Introduction to ROC Analysis[J]. Pattern Recognition Letters, 2006, 27(8): 861-874. doi: 10.1016/j.patrec.2005.10.010
[29]	刘红梅. 基于关联规则的分类方法初探[J]. 电脑知识与技术, 2009, 5(3): 535-536. doi: 10.3969/j.issn.1009-3044.2009.03.009
[29]	(Liu Hongmei.Research of Association Rule Classification[J]. Computer Knowledge and Technology, 2009, 5(3): 535-536.) doi: 10.3969/j.issn.1009-3044.2009.03.009
[30]	Zaïane O, Antonie M L.On Pruning and Tuning Rules for Associative Classifiers[A]// Knowledge-based Intelligent Information and Engineering Systems[M]. Springer Berlin Heidelberg, 2005.

[1]	韩普, 王鹏. 基于无标度网络模型和传染病模型的舆论演化仿真研究^*[J]. 数据分析与知识发现, 2017, 1(10): 53-63.
[2]	申雪锋, 柯永振, 姚楠. 多视图合作的联盟数据可视化分析[J]. 数据分析与知识发现, 2017, 1(3): 21-28.
[3]	邢美凤, 许德山. 可视化的共词聚类系统分析及实现[J]. 现代图书情报技术, 2011, 27(7/8): 62-67.

Viewed

Full text

Abstract

Cited

Shared

Discussed