New Technology of Library and Information Service  2011, Vol. 27 Issue (7/8): 82-90    DOI: 10.11925/infotech.1003-3513.2011.07-08.14
Approximately Duplicate Data Cleaning Algorithm Based on Improved Edit Distance
Ye Huanzhuo, Wu Di
School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan 430073, China
Abstract  Similarity calculation is a key issue in the process of approximately duplicate data cleaning,and edit distance algorithm is widely used in this application. Based on the traditional edit distance algorithm, by analyzing the sequence length, synonyms and other factors which affect the similarity of the results, an improved approximately duplicate data cleaning algorithm based on semantic edit distance is proposed. This algorithm used synonyms thesaurus and normalized distance metric, and it can be applied to similar records identification process. Experimental results show that the calculating results by this improved algorithm become more in line with the sentence semantic information and people's cognitive experience. Thereby, the method effectively improves the accuracy and precision of detect approximately duplicate data.
Key wordsApproximately duplicate data      Edit distance      Semantic      Synonyms thesaurus     
Received: 28 April 2011      Published: 09 October 2011

G202 TP391.1


Cite this article:

Ye Huanzhuo, Wu Di. Approximately Duplicate Data Cleaning Algorithm Based on Improved Edit Distance. New Technology of Library and Information Service, 2011, 27(7/8): 82-90.

