Please wait a minute...
New Technology of Library and Information Service  2010, Vol. 26 Issue (9): 56-66    DOI: 10.11925/infotech.1003-3513.2010.09.10
article Current Issue | Archive | Adv Search |
A Survey of Approximately Duplicate Data Cleaning Method
Ye Huanzhuo, Wu Di
Department of Information, School of Information and Safety Engineering, Zhongnan University of Economics and Law,Wuhan 430073, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

This paper introduces the steps, frameworks and metrics of approximately duplicate data cleaning. Then, the detect algorithms and the elimination algorithms are surveyed essentially,according to type and their improvement methods, and the algorithms usage scope and their advantages and disadvantages are given. Many data cleaning tools are presented, such as Merge/Purge. Finaly, it discusses the future research topics in data cleaning and points out that the concept of knowledge and semantic used in the framework of data cleaning will be an important trend.

Key wordsApproximately      duplicate      data      Data      cleaning      Detect      algorithm      Eliminate      algorithm     
Received: 12 July 2010      Published: 26 October 2010
: 

G202 TP391.1

 

Cite this article:

Ye Huanzhuo, Wu Di. A Survey of Approximately Duplicate Data Cleaning Method. New Technology of Library and Information Service, 2010, 26(9): 56-66.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2010.09.10     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2010/V26/I9/56


[1] Rahm E, Do H H. Data Cleaning: Problems and Current Approaches
[J]. IEEE Data Engineering Bulletin, 2000, 23(4): 3-13.

[2] Galhardas H, Florescu D, Shasha D, et al. Declarative Data Cleaning: Language, Model and Algorithms . In: Proceedings of the 27th International Conference on Very Large Data Bases, Rome, Italy. 2001: 371-380.

[3] Bitton D, DeWitt D J. Duplicate Record Elimination in Large Data Files
[J]. ACM Transactions on Database Systems, 1983, 8(2): 255-265.

[4] Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate Record Detection: A Survey
[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1):1-16.

[5] Madnick S, Wang R Y. Zhang W. A Framework for Corporate Householding . In: Proceedings of the 7th International Conference on Information Quality. 2002: 36-46.

[6] 陆凤霞,王静秋,王宁生. 一种开放式的数据清理框架
[J]. 南京航空航天大学学报 ,2006,38(4): 459-463.

[7] 王曰芬,章成志,张蓓蓓,等. 数据清洗研究综述
[J]. 现代图书情报技术 ,2007(12): 50-56.

[8] 韩京宇,徐立臻,董逸生. 数据质量研究综述
[J]. 计算机科学 ,2008,35(2): 1-5, 12.

[9] Panse F, Keulen M V, Keijzer A D, et al. Duplicate Detection in Probabilistic Data . In: Proceedings of the 26th International Conference on Data Engineering Workshop, Long Beach, CA. 2010: 179-182.

[10] Batin C, Scannapieca M. Data Quality: Concepts, Methodologies and Techniques
[M]. Springer, 2006: 230-235.

[11] 曹忠升,万劲伟. 基于语义的数据清理技术
[J]. 华中科技大学学报:自然科学版 ,2005,33(2): 76-78.

[12] 夏骄雄,徐俊,吴耿锋. 数据清理中同体不同源数据的数化算法研究
[J]. 计算机工程 ,2007,33(1): 71-73.

[13] Elfeky M G, Verykios V S. On Search Enhancement of the Record Linkage Process . In: Proceedings of the KDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington DC. 2003: 31-33.

[14] 陈伟,丁秋林. 可扩展数据清理平台的研究
[J]. 电子科技大学学报 ,2006,35(1): 100-103.

[15] Rehman M, Esichaikul V. Duplicate Record Detection for Database Cleansing . In: Proceedings of the 2nd International Conference on Machine Vision, Dubai, United Arab Emirates. 2009: 333-338.

[16] Wai L L, Mong L L, Tok W L. A Knowledge-based Approach for Duplicate Elimination in Data Cleaning
[J]. Information System, 2001, 26(8):585-606.

[17] Ye H Z, Wu D, Chen S. An Open Data Cleaning Framework Based on Semantic Rules for Continuous Auditing . In: Proceedings of the 2nd International Conference on Computer Engineering and Technology, Chengdu, China. 2010: 158-162.

[18] Monge A E, Elkan C P. The Field Matching Problem: Algorithms and Applications . In: Proceedings of the 2nd Conference on Knowledge Discovery and Data Mining, Portland, Oregon. 1996: 267-270.

[19] Minton S N, Nanjo C, Knoblock C A, et al. A Heterogeneous Field Matching Method for Record Linkage . In: Proceeding of the 5th IEEE International Conference on Data Mining, Houston, Texas, USA. 2005: 314-321.

[20] 陈挺,郭颖,刘云超. 中文字段匹配算法
[J]. 计算机工程 , 2003, 29(13): 118-124.

[21] Smith T F, Waterman M S. Identification of Common Molecular Subsequences
[J]. Journal of Molecular Biology, 1981, 147(1): 195-197.

[22] Nawaz Z, Al-Als Z, Bertels K, et al. Acceleration of Smith-Waterman Using Recursive Variable Expansion . In: Proceedings of the 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools, Parma, Italy. 2008: 915-922.

[23] Liao H Y, Yin M L, Cheng Y. A Parallel Implementation of the Smith-Waterman Algorithm for Massive Sequences Searching . In: Proceedings of the 26th Annual International Conference on Engineering in Medicine and Biology Society, San Francisco, CA. 2004: 2817-2820.

[24] Su Z, Ahn Byung-Ryul, Eom Ki-yol, et al. Plagiarism Detection Using the Levenshtein Distance and Smith-Waterman Algorithm . In: Proceedings of the 3rd International Conference on Innovative Computing Information and Control, Dalian, China. 2008: 569-572.

[25] Levenshtein I V. Binary Codes Capable of Correcting Spurious Insertions and Deletions of Ones
[J]. Problems of Information Transmission, 1965(1): 8-17.

[26] Monge A E, Elkan C. An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records . In: Proceedings of the SIFMOD Workshop on Data Mining and Knowledge Discovery, Tuscan, Arizona. 1997: 23-29.

[27] Cohen W W, Ravikumar P, Fienberg S. A Comparison of String Metrics for Matching Names and Records . In: Proceedings of the Workshop on Data Cleaning and Object Consolidation at the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington DC, USA. 2003: 13-18.

[28] Liu X H, Li G L, Feng J H, et al. Effective Indices for Efficient Approximate String Search and Similarity Join . In: Proceedings of the 9th International Conference on Web-Age Information Management, Zhangjiajie, China. 2008: 127-134.

[29] Zhu M D, Shen D R, Nie T Z, et al. An Adjusted-Edit Distance Algorithm Applying to Web Environment . In: Proceedings of the 6th International Conference on Web Information Systems and Applications, Xuzhou, China. 2009: 71-75.

[30] 车万翔,刘挺,秦兵,等. 基于改进编辑距离的中文相似句子检索
[J]. 高技术通讯 ,2004(7):15-19.

[31] 刘宝艳,林鸿飞,赵晶. 基于改进编辑距离和依存文法的汉语句子相似度计算
[J]. 计算机应用与软件 ,2008,25(7): 33-34, 47.

[32] 孙铁民,于杰,尚程,等. 基于无监督学习的数据清洗算法
[J]. 吉林大学学报:信息科学版 ,2008,26(6): 599-604.

[33] 唐懿芳,钟达夫,严小卫. 基于聚类模式的数据清洗技术
[J]. 计算机应用 ,2004,24(5): 116-119.

[34] Clarke C L A, Cormack G V. Dynamic Inverted Indexes for a Distributed Full-Text Retrieval System
[J]. Technical Report, 1995, (1): 1-13.

[35] McCallum A, Nigam K, Ungar L H. Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching . In: Proceeding of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA. 2000: 169-178.

[36] 戴东波,汤春蕾,熊贇. 基于整体和局部相似性的序列聚类算法
[J]. 软件学报 ,2010,21(4): 702-717.

[37] Ma D Y, Zhang A D. An Adaptive Density-Based Clustering Algorithm for Spatial Database with Noise . In: Proceedings of the 4th International Conference on Data Mining, Brighton, UK. 2004: 467-470.

[38] Ester M, Kriegel H P, Sander J, et al. A Density based Algorithm for Discovering Clusters in Large Spatial Database with Noise . In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, Oregon. 1996: 226-231.

[39] Huang T Q, Yu Y Q, Li K, et al. Reckon the Parameter of DBSCAN for Multi-Density Data Sets with Constraints . In: Proceedings of the International Conference on Artificial Intelligence and Computational Intelligence, Shanghai, China. 2009: 375-379.

[40] 冯少荣,肖文俊. DBSCAN聚类算法的研究与改进
[J]. 中国矿业大学学报 ,2008,37(1): 105-111.

[41] Ram A, Sharma A, Jalal A S, et al. An Enhanced Density Based Spatial Clustering of Applications with Noise . In: Proceedings of the IEEE International Conference on Advance Computing, Patiala, India. 2009: 1475-1478.

[42] 曾东海,米红,刘力丰. 一种基于网格密度与空间划分树的聚类算法
[J]. 系统工程理论与实践 ,2008,28(7): 125-131, 137.

[43] Mahran S, Mahar K. Using Grid for Accelerating Density-based Clustering . In: Proceedings of the 8th IEEE International Conference on Computer and Information Technology, Sydney, Australia. 2008: 35-40.

[44] Huang M, Bian F L. A Grid and Density Based Fast Spatial Clustering Algorithm . In: Proceedings of the International Conference on Artificial Intelligence and Computational Intelligence, Shanghai, China. 2009: 260-263.

[45] Meng L L, Ren J D, Hu C Z. CABGD: An Improved Clustering Algorithm Based on Grid-Density . In: Proceedings of the 4th International Conference on Innovative Computing, Information and Control, Kaohsiung, Taiwan. 2009: 381-384.

[46] Chen J B, Sun J, Chen Y F. A New Ant-Based Clustering Algorithm on High Dimensional Data Space
[J]. Complex Systems Concurrent Engineering, 2007(12): 605-611.

[47] Madeiro S S, Bastos-Filho C J A, Neto F B L, et al. Adaptative Clustering Particle Swarm Optimization . In: Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, Rome, Italy. 2009: 2257-2264.

[48] 陈卓,贺明霞,刘相双. 基于扩展凝聚点和网格的增量聚类算法
[J]. 哈尔滨工业大学学报 ,2006,38(8): 1382-1385, 1398.

[49] Agrawal R, Gehrke J E, Gunopulos D, et al. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications . In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington DC, USA. 1998: 94-105.

[50] 王鹏,曾振柄,谢千河. 采用蚁群爬山法进行聚类分析的算法
[J]. 计算机工程 ,2003,29(10): 79-80.

[51] Hylton J A. Identifying and Merging Related Bibliographic Records
[M]. Cambridge, MA, USA: Massachusetts Institute of Technology , 1996: 46-50.

[52] 邱越峰,田增平,季文,等. 一种高效的检测相似重复记录的方法
[J]. 计算机学报 ,2001,24(1): 69-77.

[53] 韩京宇,徐立臻,董逸生. 一种大数据量的相似记录检测方法
[J]. 计算机研究与发展 ,2005,42(12): 2206-2212.

[54] Hernandez M A. A Generalization of Band Joins and the Merge/Purge Problem . New York: Columbia University, 1996.

[55] 郑仕辉,周傲英,张龙. XML文档的相似测度和结构索引研究
[J]. 计算机学报 ,2003,26(9): 1116-1122.

[56] 陈伟,丁秋林. 一种XML相似重复数据的清理方法研究
[J]. 北京航空航天大学学报 ,2004,30(9): 835-838.

[57] Tai K C. The Tree-to-Tree Correction Problem
[J]. Journal of the ACM, 1979, 26(3): 422-433.

[58] 谌志群. XML文档相似度计算方法研究
[J]. 情报学报 ,2009,28(1): 48-57.

[59] Aratsu T, Hirata K, Kuboyama T. Approximating Tree Edit Distance Through String Edit Distance for Binary Tree Codes
[J]. Lecture Notes in Computer Science, 2009, 54(4): 93-104.

[60] Akutsu T, Fukagawa D, Takasu A. Approximating Tree Edit Distance through String Edit Distance for Binary Tree Codes
[J]. Lecture Notes in Computer Science, 2009, 54(4): 93-104.

[61] 龚安,刘华山.基于编辑距离的XML文档结构聚类的改进算法
[J]. 微计算机应用 ,2008,29(2): 88-91.

[62] Lwin T, Nyunt T T S. An Efficient Duplicate Detection System for XML Documents . In: Proceedings of the 2nd International Conference on Computer Engineering and Applications, Bali Island, Indonesia. 2010: 178-182.

[63] Kimball R. Dealing with Dirty Data
[J]. DBMS, 1996, 9(10): 55-60.

[64] 王金铨,梁茂成,俞洪亮. 基于N-Gram和向量空间模型的语句相似度研究
[J]. 现代外语 ,2007,30(4): 405-413.

[65] Gravano L, Ipeirotis PG, Koudas N, et al. Text Joins in an RDBMS for Web Data Integration . In: Proceedings of the 12th International Conference on World Wide Web, Budapest, Hungary. 2003: 90-101.

[66] Hernandez M A, Stolfo S J. Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem
[J]. Data Mining and Knowledge Discovery, 1998, 2(1): 9-37.

[67] Shahri H H, Barforush A A Z. Data Mining for Removing Fuzzy Duplicates using Fuzzy Inference . In: Proceedings of the IEEE Internal Conference on the North American Fuzzy Information Processing Society, Banff, Alberta, Canada. 2004: 419-424.

[68] Bilenko M, Mooney R J. Adaptive Duplicate Detection using Learnable String Similarity Measures . In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington DC, USA, 2003: 39-48.

[69] Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating Fuzzy Duplicates in Data Warehouses . In: Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, China. 2002: 586-597.

[70] Chaudhuri S, Ganti V, Motwani R. Robust Identification of Fuzzy Duplicates . In: Proceedings of the 21st International Conference on Data Engineering, Tokyo, Japan. 2005:865-876.

[71] 钟嘉庆,张义芳,卢志刚. 数据仓库中重复记录清理算法研究
[J]. 微型机与应用 ,2009,28(7): 4-6.

[72] Hernandez M, Stolfo S. The Merge/Purge Problem for Large Databases . In: Proceedings of the ACM SIGMOD International Conference on Management of Data, San Jose, California. 1995: 127-138.

[73] 李坚,郑宁. 对基于MPN数据清洗算法的改进
[J]. 计算机应用与软件 ,2008,25(2): 245-247.

[74] Monge A, Elkan C. An Efficient Domain Independent Algorithm for Detecting Approximately Duplicate Database Records . In: Proceedings of the SIGMOD Workshop on Data Mining and Knowledge Discovery, Tucson, Arizona. 1997: 23-29.

[75] 朱恒民,王宁生. 一种改进的相似重复记录检测方法
[J]. 控制与决策 ,2006,21(7): 805-808,813.

[76] 佘春红. 基于优先队列的增量式重复记录识别
[J]. 计算机应用 ,2003,23(9): 61-63.

[77] Kan Q, Yang Y J, Liu W H, et al. An Integrated Approach for Detecting Approximate Duplicate Records . In: Proceedings of the 2nd Asia-Pacific Conference on Computational Intelligence and Industrial Applications, Wuhan, China. 2009: 381-384.

[78] Kan Q, Yang Y J, Zhen S Q, et al. A Unified Record Linkage Strategy for Web Service Data . In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, Phuket, Thailand. 2010: 253-256.

[79] Broder A Z, Glassman S C, Manasse M S, et al. Syntactic Clustering of the Web
[J]. Computer Networks and ISDN Systems, 1997, 29(8):1157-1165.

[80] Li C, Lu J H, Lu Y M. Efficient Merging and Filtering Algorithms for Approximate String Searches . In: Proceedings of the IEEE International Conference on Data Engineering, Cancun, Mexico. 2008: 257-266.

[81] Cheng T Y, Wang S. A Novel Approach to Clustering Merchandise Records
[J]. Journal of Computer Science and Technology, 2007, 22(2): 228-231.

[82] 陈伟,Qiu R. 审计软件现状及发展趋势研究
[J]. 计算机科学 ,2009,36(2): 1-4, 25.

[1] Li Wenna,Zhang Zhixiong. Research on Knowledge Base Error Detection Method Based on Confidence Learning[J]. 数据分析与知识发现, 2021, 5(9): 1-9.
[2] Su Qiang, Hou Xiaoli, Zou Ni. Predicting Surgical Infections Based on Machine Learning[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[3] Gu Yaowen, Zhang Bowen, Zheng Si, Yang Fengchun, Li Jiao. Predicting Drug ADMET Properties Based on Graph Attention Network[J]. 数据分析与知识发现, 2021, 5(8): 76-85.
[4] Xu Liangchen, Guo Chonghui. Predicting Survival Rates for Gastric Cancer Based on Ensemble Learning[J]. 数据分析与知识发现, 2021, 5(8): 86-99.
[5] Lu Quan, He Chao, Chen Jing, Tian Min, Liu Ting. A Multi-Label Classification Model with Two-Stage Transfer Learning[J]. 数据分析与知识发现, 2021, 5(7): 91-100.
[6] Yang Hanxun, Zhou Dequn, Ma Jing, Luo Yongcong. Detecting Rumors with Uncertain Loss and Task-level Attention Mechanism[J]. 数据分析与知识发现, 2021, 5(7): 101-110.
[7] Dong Mei,Chang Zhijun,Zhang Runjie. A Multiple Pattern Matching Algorithm for Specifications of Incremental Metadata for Sci-Tech Literature[J]. 数据分析与知识发现, 2021, 5(6): 135-144.
[8] Yin Pengbo,Pan Weimin,Zhang Haijun,Chen Degang. Identifying Clickbait with BERT-BiGA Model[J]. 数据分析与知识发现, 2021, 5(6): 126-134.
[9] Dong Zhenheng,Lv Xueqiang,Ren Weiping,Jiang Yang,Li Guolin. Review of Key Technologies of High Performance Blockchain[J]. 数据分析与知识发现, 2021, 5(6): 14-24.
[10] Wu Xu,Chen Chunxu. Detecting Topics of Group Chats with Multiple Strategies[J]. 数据分析与知识发现, 2021, 5(5): 1-9.
[11] Lu Linong,Zhu Zhongming,Zhang Wangqiang,Wang Xiaochun. Cross-database Knowledge Integration and Fingerprint of Institutional Repositories with Lingo3G Clustering Algorithm[J]. 数据分析与知识发现, 2021, 5(5): 127-132.
[12] Liu Tong,Liu Chen,Ni Weijian. A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
[13] Ma Yingxue,Gan Mingxin,Xiao Kejun. A Matrix Factorization Recommendation Method with Tags and Contents[J]. 数据分析与知识发现, 2021, 5(5): 71-82.
[14] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[15] Wang Nan,Li Hairong,Tan Shuru. Predicting of Public Opinion Reversal with Improved SMOTE Algorithm and Ensemble Learning[J]. 数据分析与知识发现, 2021, 5(4): 37-48.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn