[1] Rahm E, Do H H. Data Cleaning: Problems and Current Approaches
[J]. IEEE Data Engineering Bulletin, 2000, 23(4): 3-13.
[2] Galhardas H, Florescu D, Shasha D, et al. Declarative Data Cleaning: Language, Model and Algorithms . In: Proceedings of the 27th International Conference on Very Large Data Bases, Rome, Italy. 2001: 371-380.
[3] Bitton D, DeWitt D J. Duplicate Record Elimination in Large Data Files
[J]. ACM Transactions on Database Systems, 1983, 8(2): 255-265.
[4] Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate Record Detection: A Survey
[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1):1-16.
[5] Madnick S, Wang R Y. Zhang W. A Framework for Corporate Householding . In: Proceedings of the 7th International Conference on Information Quality. 2002: 36-46.
[6] 陆凤霞,王静秋,王宁生. 一种开放式的数据清理框架
[J]. 南京航空航天大学学报 ,2006,38(4): 459-463.
[7] 王曰芬,章成志,张蓓蓓,等. 数据清洗研究综述
[J]. 现代图书情报技术 ,2007(12): 50-56.
[8] 韩京宇,徐立臻,董逸生. 数据质量研究综述
[J]. 计算机科学 ,2008,35(2): 1-5, 12.
[9] Panse F, Keulen M V, Keijzer A D, et al. Duplicate Detection in Probabilistic Data . In: Proceedings of the 26th International Conference on Data Engineering Workshop, Long Beach, CA. 2010: 179-182.
[10] Batin C, Scannapieca M. Data Quality: Concepts, Methodologies and Techniques
[M]. Springer, 2006: 230-235.
[11] 曹忠升,万劲伟. 基于语义的数据清理技术
[J]. 华中科技大学学报:自然科学版 ,2005,33(2): 76-78.
[12] 夏骄雄,徐俊,吴耿锋. 数据清理中同体不同源数据的数化算法研究
[J]. 计算机工程 ,2007,33(1): 71-73.
[13] Elfeky M G, Verykios V S. On Search Enhancement of the Record Linkage Process . In: Proceedings of the KDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington DC. 2003: 31-33.
[14] 陈伟,丁秋林. 可扩展数据清理平台的研究
[J]. 电子科技大学学报 ,2006,35(1): 100-103.
[15] Rehman M, Esichaikul V. Duplicate Record Detection for Database Cleansing . In: Proceedings of the 2nd International Conference on Machine Vision, Dubai, United Arab Emirates. 2009: 333-338.
[16] Wai L L, Mong L L, Tok W L. A Knowledge-based Approach for Duplicate Elimination in Data Cleaning
[J]. Information System, 2001, 26(8):585-606.
[17] Ye H Z, Wu D, Chen S. An Open Data Cleaning Framework Based on Semantic Rules for Continuous Auditing . In: Proceedings of the 2nd International Conference on Computer Engineering and Technology, Chengdu, China. 2010: 158-162.
[18] Monge A E, Elkan C P. The Field Matching Problem: Algorithms and Applications . In: Proceedings of the 2nd Conference on Knowledge Discovery and Data Mining, Portland, Oregon. 1996: 267-270.
[19] Minton S N, Nanjo C, Knoblock C A, et al. A Heterogeneous Field Matching Method for Record Linkage . In: Proceeding of the 5th IEEE International Conference on Data Mining, Houston, Texas, USA. 2005: 314-321.
[20] 陈挺,郭颖,刘云超. 中文字段匹配算法
[J]. 计算机工程 , 2003, 29(13): 118-124.
[21] Smith T F, Waterman M S. Identification of Common Molecular Subsequences
[J]. Journal of Molecular Biology, 1981, 147(1): 195-197.
[22] Nawaz Z, Al-Als Z, Bertels K, et al. Acceleration of Smith-Waterman Using Recursive Variable Expansion . In: Proceedings of the 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools, Parma, Italy. 2008: 915-922.
[23] Liao H Y, Yin M L, Cheng Y. A Parallel Implementation of the Smith-Waterman Algorithm for Massive Sequences Searching . In: Proceedings of the 26th Annual International Conference on Engineering in Medicine and Biology Society, San Francisco, CA. 2004: 2817-2820.
[24] Su Z, Ahn Byung-Ryul, Eom Ki-yol, et al. Plagiarism Detection Using the Levenshtein Distance and Smith-Waterman Algorithm . In: Proceedings of the 3rd International Conference on Innovative Computing Information and Control, Dalian, China. 2008: 569-572.
[25] Levenshtein I V. Binary Codes Capable of Correcting Spurious Insertions and Deletions of Ones
[J]. Problems of Information Transmission, 1965(1): 8-17.
[26] Monge A E, Elkan C. An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records . In: Proceedings of the SIFMOD Workshop on Data Mining and Knowledge Discovery, Tuscan, Arizona. 1997: 23-29.
[27] Cohen W W, Ravikumar P, Fienberg S. A Comparison of String Metrics for Matching Names and Records . In: Proceedings of the Workshop on Data Cleaning and Object Consolidation at the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington DC, USA. 2003: 13-18.
[28] Liu X H, Li G L, Feng J H, et al. Effective Indices for Efficient Approximate String Search and Similarity Join . In: Proceedings of the 9th International Conference on Web-Age Information Management, Zhangjiajie, China. 2008: 127-134.
[29] Zhu M D, Shen D R, Nie T Z, et al. An Adjusted-Edit Distance Algorithm Applying to Web Environment . In: Proceedings of the 6th International Conference on Web Information Systems and Applications, Xuzhou, China. 2009: 71-75.
[30] 车万翔,刘挺,秦兵,等. 基于改进编辑距离的中文相似句子检索
[J]. 高技术通讯 ,2004(7):15-19.
[31] 刘宝艳,林鸿飞,赵晶. 基于改进编辑距离和依存文法的汉语句子相似度计算
[J]. 计算机应用与软件 ,2008,25(7): 33-34, 47.
[32] 孙铁民,于杰,尚程,等. 基于无监督学习的数据清洗算法
[J]. 吉林大学学报:信息科学版 ,2008,26(6): 599-604.
[33] 唐懿芳,钟达夫,严小卫. 基于聚类模式的数据清洗技术
[J]. 计算机应用 ,2004,24(5): 116-119.
[34] Clarke C L A, Cormack G V. Dynamic Inverted Indexes for a Distributed Full-Text Retrieval System
[J]. Technical Report, 1995, (1): 1-13.
[35] McCallum A, Nigam K, Ungar L H. Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching . In: Proceeding of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA. 2000: 169-178.
[36] 戴东波,汤春蕾,熊贇. 基于整体和局部相似性的序列聚类算法
[J]. 软件学报 ,2010,21(4): 702-717.
[37] Ma D Y, Zhang A D. An Adaptive Density-Based Clustering Algorithm for Spatial Database with Noise . In: Proceedings of the 4th International Conference on Data Mining, Brighton, UK. 2004: 467-470.
[38] Ester M, Kriegel H P, Sander J, et al. A Density based Algorithm for Discovering Clusters in Large Spatial Database with Noise . In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, Oregon. 1996: 226-231.
[39] Huang T Q, Yu Y Q, Li K, et al. Reckon the Parameter of DBSCAN for Multi-Density Data Sets with Constraints . In: Proceedings of the International Conference on Artificial Intelligence and Computational Intelligence, Shanghai, China. 2009: 375-379.
[40] 冯少荣,肖文俊. DBSCAN聚类算法的研究与改进
[J]. 中国矿业大学学报 ,2008,37(1): 105-111.
[41] Ram A, Sharma A, Jalal A S, et al. An Enhanced Density Based Spatial Clustering of Applications with Noise . In: Proceedings of the IEEE International Conference on Advance Computing, Patiala, India. 2009: 1475-1478.
[42] 曾东海,米红,刘力丰. 一种基于网格密度与空间划分树的聚类算法
[J]. 系统工程理论与实践 ,2008,28(7): 125-131, 137.
[43] Mahran S, Mahar K. Using Grid for Accelerating Density-based Clustering . In: Proceedings of the 8th IEEE International Conference on Computer and Information Technology, Sydney, Australia. 2008: 35-40.
[44] Huang M, Bian F L. A Grid and Density Based Fast Spatial Clustering Algorithm . In: Proceedings of the International Conference on Artificial Intelligence and Computational Intelligence, Shanghai, China. 2009: 260-263.
[45] Meng L L, Ren J D, Hu C Z. CABGD: An Improved Clustering Algorithm Based on Grid-Density . In: Proceedings of the 4th International Conference on Innovative Computing, Information and Control, Kaohsiung, Taiwan. 2009: 381-384.
[46] Chen J B, Sun J, Chen Y F. A New Ant-Based Clustering Algorithm on High Dimensional Data Space
[J]. Complex Systems Concurrent Engineering, 2007(12): 605-611.
[47] Madeiro S S, Bastos-Filho C J A, Neto F B L, et al. Adaptative Clustering Particle Swarm Optimization . In: Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, Rome, Italy. 2009: 2257-2264.
[48] 陈卓,贺明霞,刘相双. 基于扩展凝聚点和网格的增量聚类算法
[J]. 哈尔滨工业大学学报 ,2006,38(8): 1382-1385, 1398.
[49] Agrawal R, Gehrke J E, Gunopulos D, et al. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications . In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington DC, USA. 1998: 94-105.
[50] 王鹏,曾振柄,谢千河. 采用蚁群爬山法进行聚类分析的算法
[J]. 计算机工程 ,2003,29(10): 79-80.
[51] Hylton J A. Identifying and Merging Related Bibliographic Records
[M]. Cambridge, MA, USA: Massachusetts Institute of Technology , 1996: 46-50.
[52] 邱越峰,田增平,季文,等. 一种高效的检测相似重复记录的方法
[J]. 计算机学报 ,2001,24(1): 69-77.
[53] 韩京宇,徐立臻,董逸生. 一种大数据量的相似记录检测方法
[J]. 计算机研究与发展 ,2005,42(12): 2206-2212.
[54] Hernandez M A. A Generalization of Band Joins and the Merge/Purge Problem . New York: Columbia University, 1996.
[55] 郑仕辉,周傲英,张龙. XML文档的相似测度和结构索引研究
[J]. 计算机学报 ,2003,26(9): 1116-1122.
[56] 陈伟,丁秋林. 一种XML相似重复数据的清理方法研究
[J]. 北京航空航天大学学报 ,2004,30(9): 835-838.
[57] Tai K C. The Tree-to-Tree Correction Problem
[J]. Journal of the ACM, 1979, 26(3): 422-433.
[58] 谌志群. XML文档相似度计算方法研究
[J]. 情报学报 ,2009,28(1): 48-57.
[59] Aratsu T, Hirata K, Kuboyama T. Approximating Tree Edit Distance Through String Edit Distance for Binary Tree Codes
[J]. Lecture Notes in Computer Science, 2009, 54(4): 93-104.
[60] Akutsu T, Fukagawa D, Takasu A. Approximating Tree Edit Distance through String Edit Distance for Binary Tree Codes
[J]. Lecture Notes in Computer Science, 2009, 54(4): 93-104.
[61] 龚安,刘华山.基于编辑距离的XML文档结构聚类的改进算法
[J]. 微计算机应用 ,2008,29(2): 88-91.
[62] Lwin T, Nyunt T T S. An Efficient Duplicate Detection System for XML Documents . In: Proceedings of the 2nd International Conference on Computer Engineering and Applications, Bali Island, Indonesia. 2010: 178-182.
[63] Kimball R. Dealing with Dirty Data
[J]. DBMS, 1996, 9(10): 55-60.
[64] 王金铨,梁茂成,俞洪亮. 基于N-Gram和向量空间模型的语句相似度研究
[J]. 现代外语 ,2007,30(4): 405-413.
[65] Gravano L, Ipeirotis PG, Koudas N, et al. Text Joins in an RDBMS for Web Data Integration . In: Proceedings of the 12th International Conference on World Wide Web, Budapest, Hungary. 2003: 90-101.
[66] Hernandez M A, Stolfo S J. Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem
[J]. Data Mining and Knowledge Discovery, 1998, 2(1): 9-37.
[67] Shahri H H, Barforush A A Z. Data Mining for Removing Fuzzy Duplicates using Fuzzy Inference . In: Proceedings of the IEEE Internal Conference on the North American Fuzzy Information Processing Society, Banff, Alberta, Canada. 2004: 419-424.
[68] Bilenko M, Mooney R J. Adaptive Duplicate Detection using Learnable String Similarity Measures . In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington DC, USA, 2003: 39-48.
[69] Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating Fuzzy Duplicates in Data Warehouses . In: Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, China. 2002: 586-597.
[70] Chaudhuri S, Ganti V, Motwani R. Robust Identification of Fuzzy Duplicates . In: Proceedings of the 21st International Conference on Data Engineering, Tokyo, Japan. 2005:865-876.
[71] 钟嘉庆,张义芳,卢志刚. 数据仓库中重复记录清理算法研究
[J]. 微型机与应用 ,2009,28(7): 4-6.
[72] Hernandez M, Stolfo S. The Merge/Purge Problem for Large Databases . In: Proceedings of the ACM SIGMOD International Conference on Management of Data, San Jose, California. 1995: 127-138.
[73] 李坚,郑宁. 对基于MPN数据清洗算法的改进
[J]. 计算机应用与软件 ,2008,25(2): 245-247.
[74] Monge A, Elkan C. An Efficient Domain Independent Algorithm for Detecting Approximately Duplicate Database Records . In: Proceedings of the SIGMOD Workshop on Data Mining and Knowledge Discovery, Tucson, Arizona. 1997: 23-29.
[75] 朱恒民,王宁生. 一种改进的相似重复记录检测方法
[J]. 控制与决策 ,2006,21(7): 805-808,813.
[76] 佘春红. 基于优先队列的增量式重复记录识别
[J]. 计算机应用 ,2003,23(9): 61-63.
[77] Kan Q, Yang Y J, Liu W H, et al. An Integrated Approach for Detecting Approximate Duplicate Records . In: Proceedings of the 2nd Asia-Pacific Conference on Computational Intelligence and Industrial Applications, Wuhan, China. 2009: 381-384.
[78] Kan Q, Yang Y J, Zhen S Q, et al. A Unified Record Linkage Strategy for Web Service Data . In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, Phuket, Thailand. 2010: 253-256.
[79] Broder A Z, Glassman S C, Manasse M S, et al. Syntactic Clustering of the Web
[J]. Computer Networks and ISDN Systems, 1997, 29(8):1157-1165.
[80] Li C, Lu J H, Lu Y M. Efficient Merging and Filtering Algorithms for Approximate String Searches . In: Proceedings of the IEEE International Conference on Data Engineering, Cancun, Mexico. 2008: 257-266.
[81] Cheng T Y, Wang S. A Novel Approach to Clustering Merchandise Records
[J]. Journal of Computer Science and Technology, 2007, 22(2): 228-231.
[82] 陈伟,Qiu R. 审计软件现状及发展趋势研究
[J]. 计算机科学 ,2009,36(2): 1-4, 25.
|