Please wait a minute...
New Technology of Library and Information Service  2011, Vol. 27 Issue (7/8): 82-90    DOI: 10.11925/infotech.1003-3513.2011.07-08.14
Current Issue | Archive | Adv Search |
Approximately Duplicate Data Cleaning Algorithm Based on Improved Edit Distance
Ye Huanzhuo, Wu Di
School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan 430073, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  Similarity calculation is a key issue in the process of approximately duplicate data cleaning,and edit distance algorithm is widely used in this application. Based on the traditional edit distance algorithm, by analyzing the sequence length, synonyms and other factors which affect the similarity of the results, an improved approximately duplicate data cleaning algorithm based on semantic edit distance is proposed. This algorithm used synonyms thesaurus and normalized distance metric, and it can be applied to similar records identification process. Experimental results show that the calculating results by this improved algorithm become more in line with the sentence semantic information and people's cognitive experience. Thereby, the method effectively improves the accuracy and precision of detect approximately duplicate data.
Key wordsApproximately duplicate data      Edit distance      Semantic      Synonyms thesaurus     
Received: 28 April 2011      Published: 09 October 2011
: 

G202 TP391.1

 

Cite this article:

Ye Huanzhuo, Wu Di. Approximately Duplicate Data Cleaning Algorithm Based on Improved Edit Distance. New Technology of Library and Information Service, 2011, 27(7/8): 82-90.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2011.07-08.14     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2011/V27/I7/8/82

[1] A Practical Guide to Achieving Enterprise Data Quality.http://enos.itcollege.ee/~gseier/Achieving%20data%20quality.pdf.

[2] Rahm E, Do H H. Data Cleaning: Problems and Current Approaches[J]. IEEE Data Engineering Bulletin, 2000, 23(4): 3-13.

[3] Mikhail B, Raymond J M. Adaptive Duplicate Detection Using Learnable String Similarity Measures . In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, 2003: 39-48.

[4] Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate Record Detection: A Survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1):1-16.

[5] Verykios V S, Elmagarmid A K, Houstis E N. Automating the Approximate Record Matching Process[J]. Journal of Information Sciences, 2000, 126(1-4): 83-98.

[6] 王曰芬,章成志,张蓓蓓,等. 数据清洗研究综述[J]. 现代图书情报技术, 2007(12): 50-56.

[7] Monge A E, Elkan C P. The Field Matching Problem: Algorithms and Applications . In: Proceedings of the 2nd Conference on Knowledge Discovery and Data Mining,Portland, Oregon,USA.1996: 267-270.

[8] Minton S N, Nanjo C, Knoblock C A, et al. A Heterogeneous Field Matching Method for Record Linkage . In: Proceeding of the 5th IEEE International Conference on Data Mining,Houston, Texas, USA.2005: 314-321.

[9] 叶焕倬,吴迪. 相似重复记录清理方法研究综述[J]. 现代图书情报技术, 2010(9):56-66.

[10] Smith T F, Waterman M S. Identification of Common Molecular Subsequences[J]. Journal of Molecular Biology, 1981, 147(1): 195-197.

[11] Levenshtein V I. Binary Codes Capable of Correcting Spurious Insertions and Deletions of Ones[J]. Problems of Information Transmission, 1965,1(1): 8-17.

[12] Lowrance R, Wagner R A. An Extension of the String-to-String Correction Problem[J]. Journal of the ACM, 1975, 22(2): 177-183.

[13] Monge A E, Elkan C P. An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records . In: Proceedings of the SIFMOD Workshop on Data Mining and Knowledge Discovery,Tuscan, Arizona, United States.1997: 23-29.

[14] Cohen W W, Ravikumar P, Fienberg S E. A Comparison of String Metrics for Matching Names and Records . In: Proceedings of the Workshop on Data Cleaning and Object Consolidation at the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Washington DC, USA. 2003: 13-18.

[15] Liu X H, Li G L, Feng J H, et al. Effective Indices for Efficient Approximate String Search and Similarity Join . In: Proceedings of the 9th International Conference on Web-Age Information Management,Zhangjiajie, China. 2008: 127-134.

[16] Zhu M D, Shen D R, Nie T Z, et al. An Adjusted-Edit Distance Algorithm Applying to Web Environment . In: Proceedings of the 6th International Conference on Web Information Systems and Applications, Xuzhou, China. 2009: 71-75.

[17] 赵作鹏,尹志民,王潜平,等. 一种改进的编辑距离算法及其在数据处理中的应用[J]. 计算机应用, 2009,29(2):424-426.

[18] 葛斌,李芳芳,郭丝路,等. 基于知网的词汇语义相似度计算方法研究[J]. 计算机应用研究, 2010,27(9):3329-3333.

[19] 蒋溢,丁优,熊安萍,等. 一种基于知网的词汇语义相似度改进计算方法[J]. 重庆邮电大学学报:自然科学版, 2009,21(4):533-537.

[20] 刘宝艳,林鸿飞,赵晶. 基于改进编辑距离和依存文法的汉语句子相似度计算[J]. 计算机应用与软件, 2008,25(7):33-34,47.

[21] 程涛,施水才,王霞,等. 基于同义词词林的中文文本主题词提取[J]. 广西师范大学学报:自然科学版, 2007,25(2): 145-148.

[22] 车万翔,刘挺,秦兵,等. 基于改进编辑距离的中文相似句子检索[J]. 高技术通讯, 2004(7):15-19.

[23] Li Y, Liu B. A Normalized Levenshtein Distance Metric[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6): 1091-1095.

[24] 梅家驹,竺一鸣,高蕴琦,等. 同义词词林[M]. 上海:上海辞书出版社,1983.

[25] 《同义词词林》扩展版.http://www.ir-lab.org/.

[26] Miller G A, Beckwith R, Fellbaum C, et al. Introduction to WordNet: An On-Line Lexical Database[J]. International Journal of Lexicography,1993,3(4):235-244.

[27] Fellbaum C. WordNet: An Electronic Lexical Database[M]. MIT Press, 1998.

[28] WordNetDotNet.http://wordnetdotnet.googlecode.com/svn/trunk/.

[29] 李玉鑑. 符号序列之间的归一化距离度量[J]. 北京工业大学学报, 2005,31(4):439-442.
[1] Li Wenna, Zhang Zhixiong. Entity Alignment Method for Different Knowledge Repositories with Joint Semantic Representation[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[2] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[3] Xu Zheng,Le Xiaoqiu. Generating AND-OR Logical Expressions for Semantic Features of Categorical Documents[J]. 数据分析与知识发现, 2021, 5(5): 95-103.
[4] Shi Xiang,Liu Ping. Extraction and Representation of Domain Knowledge with Semantic Description Model and Knowledge Elements——Case Study of Information Retrieval[J]. 数据分析与知识发现, 2021, 5(4): 123-133.
[5] Zhang Jinzhu, Yu Wenqian. Topic Recognition and Key-Phrase Extraction with Phrase Representation Learning[J]. 数据分析与知识发现, 2021, 5(2): 50-60.
[6] Shao Qi,Mu Dongmei,Wang Ping,Jin Chunyan. Identifying Subjects of Online Opinion from Public Health Emergencies[J]. 数据分析与知识发现, 2020, 4(9): 68-80.
[7] Wei Tingxin,Bai Wenlei,Qu Weiguang. Sense Prediction for Chinese OOV Based on Word Embedding and Semantic Knowledge[J]. 数据分析与知识发现, 2020, 4(6): 109-117.
[8] Deng Siyi,Le Xiaoqiu. Coreference Resolution Based on Dynamic Semantic Attention[J]. 数据分析与知识发现, 2020, 4(5): 46-53.
[9] Zhu Lu,Tian Xiaomeng,Cao Sainan,Liu Yuanyuan. Subspace Cross-modal Retrieval Based on High-Order Semantic Correlation[J]. 数据分析与知识发现, 2020, 4(5): 84-91.
[10] Zhang Dongyu,Cui Zijuan,Li Yingxia,Zhang Wei,Lin Hongfei. Identifying Noun Metaphors with Transformer and BERT[J]. 数据分析与知识发现, 2020, 4(4): 100-108.
[11] Zhang Runtong,Chen Donghua,Zhao Hongmei,Zhu Xiaomin. Computer-Assisted ICD-11 Coding Method Based on Chinese Semantic Analysis[J]. 数据分析与知识发现, 2020, 4(4): 44-55.
[12] Wei Wei,Guo Chonghui,Xing Xiaoyu. Annotating Knowledge Points & Recommending Questions Based on Semantic Association Rules[J]. 数据分析与知识发现, 2020, 4(2/3): 182-191.
[13] Tian Zhonglin,Wu Xu,Xie Xiaqing,Xu Jin,Lu Yueming. Real-time Analysis Model for Short Texts with Relationship Graph of Domain Semantics[J]. 数据分析与知识发现, 2020, 4(2/3): 239-248.
[14] Yang Lin, Huang Xiaoshuo, Wang Jiayang, Li Jiao. Extracting Clinical Scale Information and Identifying Trial Cohorts with Semantic Alignment[J]. 数据分析与知识发现, 2020, 4(12): 33-44.
[15] Zhang Jinzhu,Zhu Lipeng,Liu Jingjie. Unsupervised Cross-Language Model for Patent Recommendation Based on Representation[J]. 数据分析与知识发现, 2020, 4(10): 93-103.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn