An approach to duplicate record detection using similarity metrics and Anfis
Data quality problems arise with the constantly increasing quantity of data stored in real-world databases that are assured by the vital data cleaning process. The fundamental element of data cleaning is usually termed as duplicate record identification that is the process of identifying the record pairs signifying the same entity (duplicate records). In this paper, we have developed a domain independent approach to detect duplicate records presented in large databases. The approach adopts ANFIS and similarity functions to improve duplicate detection. In the training phase, the record level similarity is specified by the feature vector which is fed to ANFIS as input for training. The main aim of using ANFIS is to reduce the time taken for making decisions in detecting the duplicates. To minimize the number of record comparisons, an appropriate clustering method, known as K-means clustering is used in the duplicate detection phase. The experimentation is performed on the real-life datasets and the performance is evaluated with the evaluation metrics. The experimental evaluation showed that our proposed approach detects duplicates efficiently and accurately.