Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (12): 14-25    DOI: 10.11925/infotech.2096-3467.2020.0952
Current Issue | Archive | Adv Search |
k-Anonymity Algorithm of Multi-Branch-Tree Forest Based on Recognition Rate
Chen Xianlai1,2(),Luo Xiao3,Liu Li3,Li Zhongmin3,An Ying1,2
1Big Data Institute, Central South University, Changsha 410083, China
2National Engineering Laboratory for Medical Big Data Application Technology,Central South University, Changsha 410083, China
3Life Science College, Central South University, Changsha 410083, China
Download: PDF (1175 KB)   HTML ( 19
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to improve the efficiency of k-anonymity algorithm and the quality of published data. [Methods] Based on the recognition rates and multi-branch-tree forest structure, we designed a new k-anonymous algorithm (MFBRR). It conducted bottom-up reviews of data according to properties of the generalization tree, and calculated the recognition rates. Then, we selected the target leaf nodes to prune the tree, which reduced the information loss. Finally, the MFBRR-γ algorithm was proposed based on parallel computing and multi-thread processing. [Results] We evaluated our algorithms with hierarchical precision and operation time using the “Adult” data sets. The hierarchical precisions of MFBRR and MFBRR-γ were 0.97 and 0.88. It took the MFBRR and MFBRR-γ algorithms 1 457 minutes and 12.08 minutes (γ=100) to process 30,000 data sets. The MFBRR algorithm achieved hierarchical precision of 0.93 with health care data. [Limitations] We only examined our models with two data sets. [Conclusions] The proposed algorithms could reduce the information loss due to anonymity and improve the quality of published data.

Key wordsk-Anonymity Algorithm      Data Quality      Data Publishing      Multi-Branch-Tree Forest      Recognition Rate     
Received: 27 September 2020      Published: 25 December 2020
ZTFLH:  TP393  
Corresponding Authors: Chen Xianlai     E-mail: chenxianlai@csu.edu.cn

Cite this article:

Chen Xianlai, Luo Xiao, Liu Li, Li Zhongmin, An Ying. k-Anonymity Algorithm of Multi-Branch-Tree Forest Based on Recognition Rate. Data Analysis and Knowledge Discovery, 2020, 4(12): 14-25.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0952     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I12/14

ID 年龄 性别 邮编 工作时长
1 25 23200 40
2 24 23250 30
3 25 13306 35
4 21 12600 40
5 23 12650 30
6 25 13120 40
The Raw Data Sheet
ID 年龄 性别 邮编 工作时长
1 21~25 2320* 30~40
2 21~25 2320* 30~40
3 21~25 13*** 30~40
4 21~25 126** 30~40
5 21~25 126** 30~40
6 21~25 13*** 30~40
The Anonymous Data Sheet
ID A B C D
1 A1 B1 C2 D1
2 A2 B1 C1 D2
3 A1 B2 C2 D1
4 A2 B1 C3 D3
5 A1 B2 C3 D1
6 A2 B2 C3 D3
7 A3 B2 C3 D1
8 A3 B2 C4 D3
9 A4 B2 C5 D2
10 A4 B1 C4 D1
The Anonymity Strategy Sheet
The Sketch Map of Multi-Branch-Tree Forest
Processing Flow of Algorithm MFBRR
Processing Flow of Algorithm MFBRR-γ
属性 属性类型 属性值 泛化层级
Age 数值 74 6
Workclass 分类 8 3
Education 分类 16 3
Marital-status 分类 7 2
Gender 分类 2 1
Native-country 分类 41 2
Race 分类 5 2
Attribute Set of Quasi Identifier
操作系统 硬件设备 编程环境
Windows10
64位
Intel(R) Core(TM) i5-8250U CPU @1.60GHz 1.80GHz
RAM:8GB
Python3.7
Linux
Ubuntu 16.04
Intel(R) Xeon(R) Silver 4114 CPU @2.20GHz
RAM:128GB
Python3.7
Experimental Environment
Hierarchical Precision for Anonymity Algorithms
算法 分割参数γ 匿名程度k 层级准确率
MFBRR * 2 0.97
* 4 0.94
* 10 0.91
MFBRR-γ 30 2 0.88
30 4 0.80
30 10 0.71
100 2 0.84
100 4 0.73
100 10 0.63
The Comparison of Precision for Two Algorithms
数据集规模 分割参数γ k=2 k=4 k=6 k=8 k=10 k=20 k=50 k=100 k=200
50 * 0.77 0.60 0.57 0.55 0.48 0.32 * * *
100 * 0.74 0.64 0.57 0.54 0.56 0.40 0.38 * *
1 000 10 0.79 0.65 0.61 0.57 0.53 0.42 0.35 * *
1 000 * 0.89 0.80 0.77 0.73 0.72 0.65 0.54 0.50 0.33
30 000 30 0.88 0.80 0.76 0.73 0.71 0.65 0.56 0.48 0.35
30 000 100 0.84 0.73 0.68 0.65 0.63 0.55 0.43 0.32 0.11
30 000 * 0.97 0.94 0.93 0.92 0.91 0.88 0.83 0.78 0.73
Evaluation of Usability for the Algorithms
MFBRR’s Time Consumption Under Various Anonymity
MFBRR-γ’s Time Consumption Under Various Anonymity
数据集大小 分割参数γ k=2 k=4 k=6 k=8 k=10 k=20 k=50 k=100 k=200
50 * 0.11 0.14 0.13 0.13 0.13 0.14 * * *
100 * 0.45 0.61 0.76 0.74 0.51 0.52 0.56 * *
1 000 10 0.44 1.78 2.00 1.82 1.82 1.79 1.90 * *
1 000 * 13.70 26.40 33.68 41.00 46.09 51.13 62.99 50.98 52.20
30 000 30 29.30 55.90 64.90 75.20 87.90 109.50 114.00 91.40 88.60
30 000 100 12.08 19.70 22.70 24.10 24.90 27.72 23.20 25.40 32.60
30 000 * 1 457 1 875 2 066 2 352 2 582 2 926 2 767 2 665 2 534
Time Consumption Evaluation of the Algorithms
ID 性别 年龄 民族 工作类别 婚姻状况 保险类别 治疗情况 ICD代码 诊断名称
1 34 汉族 其他 未说明 省医保 其他 Z92.800 手术后状态
2 34 汉族 其他 未说明 省医保 治愈 J32.900 慢性鼻窦炎
3 34 汉族 其他 未说明 省医保 治愈 J30.4 过敏性鼻炎
4 42 汉族 农民 已婚 新农合 治愈 Z92.800 手术后状态
The Medical Data Before Anonymized (Examples)
ID 性别 年龄 民族 工作类别 婚姻状况 保险类别 治疗情况 ICD代码 诊断名称
1 27-42 汉族 * * * * Z92.800 手术后状态
2 34 汉族 其他 未说明 省医保 治愈 J32.900 慢性鼻窦炎
3 34 汉族 其他 未说明 省医保 治愈 J30.4 过敏性鼻炎
4 27-42 汉族 * * * * Z92.800 手术后状态
The Anonymized Medical Data (Examples)
性能指标 k=2 k=4 k=6 k=8 k=10 k=20 k=50 k=100 k=200
时间/s 160 233 268 420 636 1 337 2 342 2 134 2 468
层级准确率 0.93 0.89 0.87 0.84 0.83 0.74 0.61 0.48 0.37
The Performance of the Algorithm for Anonymized Medical Data Under Various Anonymity
The Precision Between Medical Data and Public Adult Data After Anonymized
[1] 牛晨晨, 周畅, 张昪 . 大数据背景下的个人隐私保护研究[J]. 西安航空学院学报, 2017,35(1):73-76.
[1] ( Niu Chenchen, Zhou Chang, Zhang Bian . Research on Personal Privacy Protection Under Big Data Background[J]. Journal of Xi’an Aeronautical University, 2017,35(1):73-76.)
[2] 王爽, 尹聪颖 . 健康医疗大数据时代的隐私保护探析[J]. 医学信息学杂志, 2019,40(1):2-5.
[2] ( Wang Shuang, Yin Congying . Discussion and Analysis on the Privacy Protection in the Age of Big Data in Healthcare[J]. Journal of Medical Informatics, 2019,40(1):2-5.)
[3] Puri V, Sachdeva S, Kaur P . Privacy Preserving Publication of Relational and Transaction Data: Survey on the Anonymization of Patient Data[J]. Computer Science Review, 2019,32:45-61.
[4] 王平水, 王建东 . 匿名化隐私保护技术研究综述[J]. 小型微型计算机系统, 2011,32(2):248-252.
[4] ( Wang Pingshui, Wang Jiandong . Survey of Research on Anonymization Privacy-Preserving Techniques[J]. Journal of Chinese Computer Systems, 2011,32(2):248-252.)
[5] Sweeney L . K-anonymity: A Model for Protecting Privacy[J]. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002,10(5):557-570.
[6] Machanavajjhala A, Kifer D, Gehrke J , et al. l-diversity: Privacy Beyond k-anonymity[J]. ACM Transactions on Knowledge Discovery from Data, 2007,1(1):1-12.
[7] Li N H, Li T C, Venkatasubramanian S. t-Closeness: Privacy Beyond k-Anonymity and l-Diversity [C]//Proceedings of IEEE 23rd International Conference on Data Engineering. 2007: 106-115.
[8] 方跃坚, 朱锦钟, 周文 , 等. 数据挖掘隐私保护算法研究综述[J]. 信息网络安全, 2017(2):6-11.
[8] ( Fang Yuejian, Zhu Jinzhong, Zhou Wen , et al. A Survey on Data Mining Privacy Protection Algorithms[J]. Netinfo Security, 2017(2):6-11.)
[9] 王静, 闫仁武, 刘亚梅 . 多敏感属性K-匿名模型的实现[J]. 计算机与数字工程, 2017,45(7):1368-1372.
[9] ( Wang Jing, Yan Renwu, Liu Yamei . Implementation of K-anonymous Model with Multi-sensitive Attributes[J]. Computer & Digital Engineering, 2017,45(7):1368-1372.)
[10] Murakami K, Uno T . Optimization Algorithm for k-anonymization of Datasets with Low Information Loss[J]. International Journal of Information Security, 2017,17(6):631-644.
[11] Fung B C M, Wang K, Yu P S. Top-Down Specialization for Information and Privacy Preservation [C]//Proceedings of the 21st International Conference on Data Engineering. 2005: 205-216.
[12] Meyerson A, Williams R. On the Complexity of Optimal K-anonymity [C]//Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 2004: 223-228.
[13] Aggarwal G, Feder T, Kenthapadi K, et al. Anonymizing Tables [C]//Proceedings of the 10th International Conference on Database Theory. 2005: 246-258.
[14] Domingo-Ferrer J, Torra V . Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation[J]. Data Mining and Knowledge Discovery, 2005,11(2):195-212.
[15] Salari M, Jalili S, Mortazavi R . TBM, a Transformation Based Method for Microaggregation of Large Volume Mixed Data[J]. Data Mining and Knowledge Discovery, 2017,31(1):65-91.
[16] Monedero D R, Mezher A M, Colome X C , et al. Efficient k-anonymous Microaggregation of Multivariate Numerical Data via Principal Component Analysis[J]. Information Sciences, 2019,503:417-443.
doi: 10.1016/j.ins.2019.07.042
[17] Abidi B, Ben Y S, Perera C . Hybrid Microaggregation for Privacy Preserving Data Mining[J]. Journal of Ambient Intelligence and Humanized Computing, 2020,11(1):23-38.
[18] 国家市场监督管理总局, 国家标准化管理委员会: 金涛, 谢安明, 陈星 , 等. GB/T 37964-2019. 信息安全技术个人信息去标识化指南[S]. 2019.
[18] ( State Administration for Market Regulation, Standardization Administration: Jin Tao, Xie Anming, Chen Xing , et al. GB/T 37964-2019. Information Security Technology—Guide for De-Identifying Personal Information[S]. 2019.)
[19] Mortazavi R, Erfani S H . GRAM: An Efficient (k, l) Graph Anonymization Method[J]. Expert Systems with Applications, 2020,153:113454.
[20] Wang Q F, Zhu G, Wang C B , et al. Research on Privacy-Preserving Methods of Electronic Medical Records[J]. Journal of Physics: Conference Series, 2019,1176(2):22-29.
[21] Sweeney L. Datafly: A System for Providing Anonymity in Medical Data [C]// Proceedings of the IFIP TC11 WG11.3 11th International Conference on Database Securty XI: Status and Prospects. 1997: 356-381.
[22] 宋明秋, 王琳, 姜宝彦 , 等. 多属性泛化的K-匿名算法[J]. 电子科技大学学报, 2017,46(6):896-901.
[22] ( Song Mingqiu, Wang Lin, Jiang Baoyan , et al. K-Anonymity Algorithm Based on Multi Attribute Generalization[J]. Journal of University of Electronic Science and Technology of China, 2017,46(6):896-901.)
[23] 谷勇浩, 郭振洋, 刘威歆 . 匿名化隐私保护技术性能评估方法研究[J]. 信息安全研究, 2019,5(4):293-297.
[23] ( Gu Yonghao, Guo Zhenyang, Liu Weixin . Research on Performance Evaluation Method of Anonymization Privacy Preservation Technologies[J]. Journal of Information Security Research, 2019,5(4):293-297.)
[1] Liu Huoyu, Wang Dongbo. Research and Implementation of Data Preprocessing Oriented to Paper Similarity Detection[J]. 现代图书情报技术, 2015, 31(5): 50-56.
[2] Wang Yuefen,Zhang Chengzhi,Zhang Beibei,Wu Tingting. A Survey of Data Cleaning[J]. 现代图书情报技术, 2007, 2(12): 50-56.
[3] Shi Xiaogang,Huang Tiejun. Auto Check of Digital Book’s Content and Structure[J]. 现代图书情报技术, 2005, 21(8): 23-26.
[4] Qin Feng,Tang Xiang,Duan Yongwei. The Study and Fulfill about Criterion of Key Word in Citation Indexes[J]. 现代图书情报技术, 2004, 20(4): 87-89.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn