Please wait a minute...
New Technology of Library and Information Service  2007, Vol. 2 Issue (12): 50-56    DOI: 10.11925/infotech.1003-3513.2007.12.11
Current Issue | Archive | Adv Search |
A Survey of Data Cleaning
Wang Yuefen1,2  Zhang Chengzhi1,2,3  Zhang Beibei1,2  Wu Tingting1,2
1(Department of Information Management, Nanjing University of Science & Technology, Nanjing 210094,China)
2(Laboratory for Enterprise Innovation Service, Nanjing University of Science & Technology, Nanjing 210094,China)
3(Institute of Scientific & Technical Information of China, Beijing 100038,China)
Download: PDF(548 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

Data cleaning problem is surveyed in this paper. Firstly, the background of data cleaning problem and research status is explained. Then, the definition and objects of data cleaning are given. The basic principle and some models of data cleaning are presented. Related algorithms and tools are analyzed and evaluation methods of data cleaning are proposed. Finally, the future research topics and application related to data cleaning problems are discussed.

Key wordsData cleaning      Data quality      Duplicate record detect      Outlier data detect     
Received: 17 September 2007      Published: 25 December 2007
: 

G350

 
Corresponding Authors: Wang Yuefen     E-mail: yuefen163@vip.163.com
About author:: Wang Yuefen,Zhang Chengzhi,Zhang Beibei,Wu Tingting

Cite this article:

Wang Yuefen,Zhang Chengzhi,Zhang Beibei,Wu Tingting. A Survey of Data Cleaning. New Technology of Library and Information Service, 2007, 2(12): 50-56.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2007.12.11     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2007/V2/I12/50

[1] Rahm E, Do H.H. Data Cleaning: Problems and Current Approaches[J]. IEEE Data Engineering Bulletin, 2000, 23(4): 3-13.
[2] Galhardas H, Florescu D. An Extensible Framework for Data Cleaning[C]. In: Proceedings of the 16th IEEE International Conference on Data Engineering. San Diego, California, 2000: 312-312.
[3] 查峰. 数据仓库化中数据清洗问题的研究[D].南京:东南大学, 2002.
[4] 刘奕群, 张敏, 马少平.面向信息检索需要的网络数据清理研究[J].中文信息学报, 2007, 20(3):70-77.
[5] BI Case Study[EB/OL]. [2007-01-09].http://www.parsintl.com/pdf/14705-BIJ-Informatica.pdf.
[6] 郭志懋, 周傲英. 数据质量和数据清洗研究综述[J]. 软件学报, 2002, 13(11): 2076-2082.
[7] Harte-Hanks Trillium Software[EB/OL]. [2007-01-09].http://www.trilliumsoftware.com.
[8] Bohn K. Converting Data for Warehouses[J]. DBMS, 1997, 10(7): 61-66.
[9] Helena G.  Generative and Transformational Techniques in Software Engineering. In: Helena G eds.Data Cleaning and Transformation Using the AJAX Framework[M].  Springer Berlin/Heidelberg,2006.
[10] 周奕辛. 数据清洗算法的研究与应用[D]. 青岛: 青岛大学, 2005.
[11] 唐懿芳, 钟达夫, 严小卫.基于聚类模式的数据清洗技术[J].计算机应用, 2004, 24(5): 116-119.
[12] Monge A, Elkan C. The Field Matching Problem: Algorithms and Applications[C]. In: Proceedings of the 2nd International Conference of Knowledge Discovery and Data Mining. Portland, Oregon, 1996.
[13] Masek W, Paterson M A. Faster Algorithm Computing String Edit Distance[J]. Journal of Computer System Science, 1980(20):18-31.
[14] 周芝芬. 基于数据仓库的数据清洗方法研究[D]. 上海:东华大学,2004.
[15] Salon G, Mcgill M J. Introduction to Modern Information Retrieval[M]. NewYork:McGraw-Hill Book Co.,  1983.
[16] Monge A, Elkan C. An Efficient Domain Independent Algorithm for Detecting Approximately Duplicate Database Records[C]. In: Proceedings of the SIGMOD Workshop on Data Mining and Knowledge Discovery. Tucson, Arizona, 1997.
[17] Hernandez M, Stolfo S. Real World Data is Dirty: Data Cleansing and the Merge/ Purge Problem[J]. Data Mining and Knowledge Discovery, 1998, 2(1): 9-37.
[18] 梁文斌.数据仓库中数据清洗的研究与设计[D]. 苏州:苏州大学,2005.
[19] 王咏梅, 陈家琪, 耿玉良.一种可交互的数据清洗系统[J].计算机工程与设计,2005,26(4): 955-957.
[20] Yair Wand, Rihard Y Wang. Anchoring Data Quality Dimensions in Ontological Foundations[J]. Communications of the ACM, 1996, 39(11):86-95.
[21] Richard Y Wang, Veda C Storey, Christopher P Firth. A Framework for Analysis of Data Quality Research[J]. IEEE Transactions on Knowledge and Data Engineering, 1995, 7(4): 623-640.
[22] Dominik Lueebber, Udo Grimmer. Systematic Development of Data Mining Based Data Quality Tools[C]. In: Proceedings of the 29th VLDB (VLDB 2003). Berlin, Germany, 2003: 548-559.

[1] Liu Huoyu, Wang Dongbo. Research and Implementation of Data Preprocessing Oriented to Paper Similarity Detection[J]. 现代图书情报技术, 2015, 31(5): 50-56.
[2] Shao Zengrong,Li Ying,Fan Tijun. The Application of Regular Expressions in Online Oil Price Event[J]. 现代图书情报技术, 2009, 3(2): 83-88.
[3] Huang Yongwen,Li Guangjian. Review on the Application Reasearch of ETL in Digital Library[J]. 现代图书情报技术, 2007, 2(12): 1-5.
[4] Shi Xiaogang,Huang Tiejun. Auto Check of Digital Book’s Content and Structure[J]. 现代图书情报技术, 2005, 21(8): 23-26.
[5] Qin Feng,Tang Xiang,Duan Yongwei. The Study and Fulfill about Criterion of Key Word in Citation Indexes[J]. 现代图书情报技术, 2004, 20(4): 87-89.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn