相似重复记录清理方法研究综述

doi:10.11925/infotech.1003-3513.2010.09.10

现代图书情报技术

2010, Vol. 26

Issue (9): 56-66 https://doi.org/10.11925/infotech.1003-3513.2010.09.10

情报分析与研究

本期目录 | 过刊浏览 | 高级检索

相似重复记录清理方法研究综述

叶焕倬, 吴迪

中南财经政法大学信息与安全工程学院信息系武汉 430073

A Survey of Approximately Duplicate Data Cleaning Method

Ye Huanzhuo, Wu Di

Department of Information, School of Information and Safety Engineering, Zhongnan University of Economics and Law,Wuhan 430073, China

摘要
参考文献
相关文章
Metrics

全文: PDF (784 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要

介绍相似重复数据清理的步骤、框架和衡量标准。重点对检测和清除算法按照算法类型及相关改进思路进行分类综述,给出算法的适用范围和优缺点,概括现有的数据清理工具(如Merge/Purge)。对相似重复记录清理领域的研究问题进行展望,将知识和语义的概念引入到数据清理框架中是未来重要的发展趋势。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	叶焕倬
	吴迪

关键词 ：相似重复记录, 数据清洗, 检测算法, 清除算法

Abstract：

This paper introduces the steps, frameworks and metrics of approximately duplicate data cleaning. Then, the detect algorithms and the elimination algorithms are surveyed essentially,according to type and their improvement methods, and the algorithms usage scope and their advantages and disadvantages are given. Many data cleaning tools are presented, such as Merge/Purge. Finaly, it discusses the future research topics in data cleaning and points out that the concept of knowledge and semantic used in the framework of data cleaning will be an important trend.

Key words： Approximately duplicate data Data cleaning Detect algorithm Eliminate algorithm

收稿日期: 2010-07-12 出版日期: 2010-10-26

G202 TP391.1

基金资助:

本文系国家自然科学基金资助项目“持续审计中智能数据处理及其应用框架研究”(项目编号:70972138)和湖北省教育厅人文社会科学基金项目“基于SOA和MAS的金融监管信息系统总体框架研究”(项目编号:2009b080)的研究成果之一。

引用本文:

叶焕倬, 吴迪. 相似重复记录清理方法研究综述[J]. 现代图书情报技术, 2010, 26(9): 56-66.
Ye Huanzhuo, Wu Di. A Survey of Approximately Duplicate Data Cleaning Method. New Technology of Library and Information Service, 2010, 26(9): 56-66.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2010.09.10 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2010/V26/I9/56

[1] Rahm E, Do H H. Data Cleaning: Problems and Current Approaches
[J]. IEEE Data Engineering Bulletin, 2000, 23(4): 3-13.

[2] Galhardas H, Florescu D, Shasha D, et al. Declarative Data Cleaning: Language, Model and Algorithms . In: Proceedings of the 27th International Conference on Very Large Data Bases, Rome, Italy. 2001: 371-380.

[3] Bitton D, DeWitt D J. Duplicate Record Elimination in Large Data Files
[J]. ACM Transactions on Database Systems, 1983, 8(2): 255-265.

[4] Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate Record Detection: A Survey
[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1):1-16.

[5] Madnick S, Wang R Y. Zhang W. A Framework for Corporate Householding . In: Proceedings of the 7th International Conference on Information Quality. 2002: 36-46.

[6] 陆凤霞,王静秋,王宁生. 一种开放式的数据清理框架
[J]. 南京航空航天大学学报 ,2006,38(4): 459-463.

[7] 王曰芬,章成志,张蓓蓓,等. 数据清洗研究综述
[J]. 现代图书情报技术 ,2007(12): 50-56.

[8] 韩京宇,徐立臻,董逸生. 数据质量研究综述
[J]. 计算机科学 ,2008,35(2): 1-5, 12.

[9] Panse F, Keulen M V, Keijzer A D, et al. Duplicate Detection in Probabilistic Data . In: Proceedings of the 26th International Conference on Data Engineering Workshop, Long Beach, CA. 2010: 179-182.

[10] Batin C, Scannapieca M. Data Quality: Concepts, Methodologies and Techniques
[M]. Springer, 2006: 230-235.

[11] 曹忠升,万劲伟. 基于语义的数据清理技术
[J]. 华中科技大学学报:自然科学版 ,2005,33(2): 76-78.

[12] 夏骄雄,徐俊,吴耿锋. 数据清理中同体不同源数据的数化算法研究
[J]. 计算机工程 ,2007,33(1): 71-73.

[13] Elfeky M G, Verykios V S. On Search Enhancement of the Record Linkage Process . In: Proceedings of the KDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington DC. 2003: 31-33.

[14] 陈伟,丁秋林. 可扩展数据清理平台的研究
[J]. 电子科技大学学报 ,2006,35(1): 100-103.

[15] Rehman M, Esichaikul V. Duplicate Record Detection for Database Cleansing . In: Proceedings of the 2nd International Conference on Machine Vision, Dubai, United Arab Emirates. 2009: 333-338.

[16] Wai L L, Mong L L, Tok W L. A Knowledge-based Approach for Duplicate Elimination in Data Cleaning
[J]. Information System, 2001, 26(8):585-606.

[17] Ye H Z, Wu D, Chen S. An Open Data Cleaning Framework Based on Semantic Rules for Continuous Auditing . In: Proceedings of the 2nd International Conference on Computer Engineering and Technology, Chengdu, China. 2010: 158-162.

[18] Monge A E, Elkan C P. The Field Matching Problem: Algorithms and Applications . In: Proceedings of the 2nd Conference on Knowledge Discovery and Data Mining, Portland, Oregon. 1996: 267-270.

[19] Minton S N, Nanjo C, Knoblock C A, et al. A Heterogeneous Field Matching Method for Record Linkage . In: Proceeding of the 5th IEEE International Conference on Data Mining, Houston, Texas, USA. 2005: 314-321.

[20] 陈挺,郭颖,刘云超. 中文字段匹配算法
[J]. 计算机工程 , 2003, 29(13): 118-124.

[21] Smith T F, Waterman M S. Identification of Common Molecular Subsequences
[J]. Journal of Molecular Biology, 1981, 147(1): 195-197.

[22] Nawaz Z, Al-Als Z, Bertels K, et al. Acceleration of Smith-Waterman Using Recursive Variable Expansion . In: Proceedings of the 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools, Parma, Italy. 2008: 915-922.

[23] Liao H Y, Yin M L, Cheng Y. A Parallel Implementation of the Smith-Waterman Algorithm for Massive Sequences Searching . In: Proceedings of the 26th Annual International Conference on Engineering in Medicine and Biology Society, San Francisco, CA. 2004: 2817-2820.

[24] Su Z, Ahn Byung-Ryul, Eom Ki-yol, et al. Plagiarism Detection Using the Levenshtein Distance and Smith-Waterman Algorithm . In: Proceedings of the 3rd International Conference on Innovative Computing Information and Control, Dalian, China. 2008: 569-572.

[25] Levenshtein I V. Binary Codes Capable of Correcting Spurious Insertions and Deletions of Ones
[J]. Problems of Information Transmission, 1965(1): 8-17.

[26] Monge A E, Elkan C. An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records . In: Proceedings of the SIFMOD Workshop on Data Mining and Knowledge Discovery, Tuscan, Arizona. 1997: 23-29.

[27] Cohen W W, Ravikumar P, Fienberg S. A Comparison of String Metrics for Matching Names and Records . In: Proceedings of the Workshop on Data Cleaning and Object Consolidation at the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington DC, USA. 2003: 13-18.

[28] Liu X H, Li G L, Feng J H, et al. Effective Indices for Efficient Approximate String Search and Similarity Join . In: Proceedings of the 9th International Conference on Web-Age Information Management, Zhangjiajie, China. 2008: 127-134.

[29] Zhu M D, Shen D R, Nie T Z, et al. An Adjusted-Edit Distance Algorithm Applying to Web Environment . In: Proceedings of the 6th International Conference on Web Information Systems and Applications, Xuzhou, China. 2009: 71-75.

[30] 车万翔,刘挺,秦兵,等. 基于改进编辑距离的中文相似句子检索
[J]. 高技术通讯 ,2004(7):15-19.

[31] 刘宝艳,林鸿飞,赵晶. 基于改进编辑距离和依存文法的汉语句子相似度计算
[J]. 计算机应用与软件 ,2008,25(7): 33-34, 47.

[32] 孙铁民,于杰,尚程,等. 基于无监督学习的数据清洗算法
[J]. 吉林大学学报:信息科学版 ,2008,26(6): 599-604.

[33] 唐懿芳,钟达夫,严小卫. 基于聚类模式的数据清洗技术
[J]. 计算机应用 ,2004,24(5): 116-119.

[34] Clarke C L A, Cormack G V. Dynamic Inverted Indexes for a Distributed Full-Text Retrieval System
[J]. Technical Report, 1995, (1): 1-13.

[35] McCallum A, Nigam K, Ungar L H. Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching . In: Proceeding of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA. 2000: 169-178.

[36] 戴东波,汤春蕾,熊贇. 基于整体和局部相似性的序列聚类算法
[J]. 软件学报 ,2010,21(4): 702-717.

[37] Ma D Y, Zhang A D. An Adaptive Density-Based Clustering Algorithm for Spatial Database with Noise . In: Proceedings of the 4th International Conference on Data Mining, Brighton, UK. 2004: 467-470.

[38] Ester M, Kriegel H P, Sander J, et al. A Density based Algorithm for Discovering Clusters in Large Spatial Database with Noise . In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, Oregon. 1996: 226-231.

[39] Huang T Q, Yu Y Q, Li K, et al. Reckon the Parameter of DBSCAN for Multi-Density Data Sets with Constraints . In: Proceedings of the International Conference on Artificial Intelligence and Computational Intelligence, Shanghai, China. 2009: 375-379.

[40] 冯少荣,肖文俊. DBSCAN聚类算法的研究与改进
[J]. 中国矿业大学学报 ,2008,37(1): 105-111.

[41] Ram A, Sharma A, Jalal A S, et al. An Enhanced Density Based Spatial Clustering of Applications with Noise . In: Proceedings of the IEEE International Conference on Advance Computing, Patiala, India. 2009: 1475-1478.

[42] 曾东海,米红,刘力丰. 一种基于网格密度与空间划分树的聚类算法
[J]. 系统工程理论与实践 ,2008,28(7): 125-131, 137.

[43] Mahran S, Mahar K. Using Grid for Accelerating Density-based Clustering . In: Proceedings of the 8th IEEE International Conference on Computer and Information Technology, Sydney, Australia. 2008: 35-40.

[44] Huang M, Bian F L. A Grid and Density Based Fast Spatial Clustering Algorithm . In: Proceedings of the International Conference on Artificial Intelligence and Computational Intelligence, Shanghai, China. 2009: 260-263.

[45] Meng L L, Ren J D, Hu C Z. CABGD: An Improved Clustering Algorithm Based on Grid-Density . In: Proceedings of the 4th International Conference on Innovative Computing, Information and Control, Kaohsiung, Taiwan. 2009: 381-384.

[46] Chen J B, Sun J, Chen Y F. A New Ant-Based Clustering Algorithm on High Dimensional Data Space
[J]. Complex Systems Concurrent Engineering, 2007(12): 605-611.

[47] Madeiro S S, Bastos-Filho C J A, Neto F B L, et al. Adaptative Clustering Particle Swarm Optimization . In: Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, Rome, Italy. 2009: 2257-2264.

[48] 陈卓,贺明霞,刘相双. 基于扩展凝聚点和网格的增量聚类算法
[J]. 哈尔滨工业大学学报 ,2006,38(8): 1382-1385, 1398.

[49] Agrawal R, Gehrke J E, Gunopulos D, et al. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications . In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington DC, USA. 1998: 94-105.

[50] 王鹏,曾振柄,谢千河. 采用蚁群爬山法进行聚类分析的算法
[J]. 计算机工程 ,2003,29(10): 79-80.

[51] Hylton J A. Identifying and Merging Related Bibliographic Records
[M]. Cambridge, MA, USA: Massachusetts Institute of Technology , 1996: 46-50.

[52] 邱越峰,田增平,季文,等. 一种高效的检测相似重复记录的方法
[J]. 计算机学报 ,2001,24(1): 69-77.

[53] 韩京宇,徐立臻,董逸生. 一种大数据量的相似记录检测方法
[J]. 计算机研究与发展 ,2005,42(12): 2206-2212.

[54] Hernandez M A. A Generalization of Band Joins and the Merge/Purge Problem . New York: Columbia University, 1996.

[55] 郑仕辉,周傲英,张龙. XML文档的相似测度和结构索引研究
[J]. 计算机学报 ,2003,26(9): 1116-1122.

[56] 陈伟,丁秋林. 一种XML相似重复数据的清理方法研究
[J]. 北京航空航天大学学报 ,2004,30(9): 835-838.

[57] Tai K C. The Tree-to-Tree Correction Problem
[J]. Journal of the ACM, 1979, 26(3): 422-433.

[58] 谌志群. XML文档相似度计算方法研究
[J]. 情报学报 ,2009,28(1): 48-57.

[59] Aratsu T, Hirata K, Kuboyama T. Approximating Tree Edit Distance Through String Edit Distance for Binary Tree Codes
[J]. Lecture Notes in Computer Science, 2009, 54(4): 93-104.

[60] Akutsu T, Fukagawa D, Takasu A. Approximating Tree Edit Distance through String Edit Distance for Binary Tree Codes
[J]. Lecture Notes in Computer Science, 2009, 54(4): 93-104.

[61] 龚安,刘华山.基于编辑距离的XML文档结构聚类的改进算法
[J]. 微计算机应用 ,2008,29(2): 88-91.

[62] Lwin T, Nyunt T T S. An Efficient Duplicate Detection System for XML Documents . In: Proceedings of the 2nd International Conference on Computer Engineering and Applications, Bali Island, Indonesia. 2010: 178-182.

[63] Kimball R. Dealing with Dirty Data
[J]. DBMS, 1996, 9(10): 55-60.

[64] 王金铨,梁茂成,俞洪亮. 基于N-Gram和向量空间模型的语句相似度研究
[J]. 现代外语 ,2007,30(4): 405-413.

[65] Gravano L, Ipeirotis PG, Koudas N, et al. Text Joins in an RDBMS for Web Data Integration . In: Proceedings of the 12th International Conference on World Wide Web, Budapest, Hungary. 2003: 90-101.

[66] Hernandez M A, Stolfo S J. Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem
[J]. Data Mining and Knowledge Discovery, 1998, 2(1): 9-37.

[67] Shahri H H, Barforush A A Z. Data Mining for Removing Fuzzy Duplicates using Fuzzy Inference . In: Proceedings of the IEEE Internal Conference on the North American Fuzzy Information Processing Society, Banff, Alberta, Canada. 2004: 419-424.

[68] Bilenko M, Mooney R J. Adaptive Duplicate Detection using Learnable String Similarity Measures . In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington DC, USA, 2003: 39-48.

[69] Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating Fuzzy Duplicates in Data Warehouses . In: Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, China. 2002: 586-597.

[70] Chaudhuri S, Ganti V, Motwani R. Robust Identification of Fuzzy Duplicates . In: Proceedings of the 21st International Conference on Data Engineering, Tokyo, Japan. 2005:865-876.

[71] 钟嘉庆,张义芳,卢志刚. 数据仓库中重复记录清理算法研究
[J]. 微型机与应用 ,2009,28(7): 4-6.

[72] Hernandez M, Stolfo S. The Merge/Purge Problem for Large Databases . In: Proceedings of the ACM SIGMOD International Conference on Management of Data, San Jose, California. 1995: 127-138.

[73] 李坚,郑宁. 对基于MPN数据清洗算法的改进
[J]. 计算机应用与软件 ,2008,25(2): 245-247.

[74] Monge A, Elkan C. An Efficient Domain Independent Algorithm for Detecting Approximately Duplicate Database Records . In: Proceedings of the SIGMOD Workshop on Data Mining and Knowledge Discovery, Tucson, Arizona. 1997: 23-29.

[75] 朱恒民,王宁生. 一种改进的相似重复记录检测方法
[J]. 控制与决策 ,2006,21(7): 805-808,813.

[76] 佘春红. 基于优先队列的增量式重复记录识别
[J]. 计算机应用 ,2003,23(9): 61-63.

[77] Kan Q, Yang Y J, Liu W H, et al. An Integrated Approach for Detecting Approximate Duplicate Records . In: Proceedings of the 2nd Asia-Pacific Conference on Computational Intelligence and Industrial Applications, Wuhan, China. 2009: 381-384.

[78] Kan Q, Yang Y J, Zhen S Q, et al. A Unified Record Linkage Strategy for Web Service Data . In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, Phuket, Thailand. 2010: 253-256.

[79] Broder A Z, Glassman S C, Manasse M S, et al. Syntactic Clustering of the Web
[J]. Computer Networks and ISDN Systems, 1997, 29(8):1157-1165.

[80] Li C, Lu J H, Lu Y M. Efficient Merging and Filtering Algorithms for Approximate String Searches . In: Proceedings of the IEEE International Conference on Data Engineering, Cancun, Mexico. 2008: 257-266.

[81] Cheng T Y, Wang S. A Novel Approach to Clustering Merchandise Records
[J]. Journal of Computer Science and Technology, 2007, 22(2): 228-231.

[82] 陈伟,Qiu R. 审计软件现状及发展趋势研究
[J]. 计算机科学 ,2009,36(2): 1-4, 25.

[1]	刘伙玉, 王东波. 面向论文相似性检测的数据预处理研究[J]. 现代图书情报技术, 2015, 31(5): 50-56.
[2]	叶焕倬, 吴迪. 基于改进编辑距离的相似重复记录清理算法[J]. 现代图书情报技术, 2011, 27(7/8): 82-90.
[3]	雷孝平, 张旭, 赵蕴华, 郑佳. 基于IRPU算法的专利数据相似重复属性及记录检测方法[J]. 现代图书情报技术, 2010, 26(12): 46-51.
[4]	邵增荣,李英,范体军. 正则表达式在油价事件网页提取中的应用*[J]. 现代图书情报技术, 2009, 3(2): 83-88.
[5]	黄永文,李广建. 数字图书馆中的ETL应用研究综述[J]. 现代图书情报技术, 2007, 2(12): 1-5.
[6]	王曰芬,章成志,张蓓蓓,吴婷婷. 数据清洗研究综述[J]. 现代图书情报技术, 2007, 2(12): 50-56.

Viewed

Full text

Abstract

Cited

Shared

Discussed