Абстрактный

An Iterative Approach to Record Deduplication

M. Roshini Karunya, S. Lalitha, B.Tech., M.E.,

Record deduplication is the task of identifying, in a data repository, records that refer to the same real world entity or object in spite of misspelling words, typos, different writing styles or even different schema representations or data types [1]. The existing system aims at providing Unsupervised Duplication Detection method which can be used to identify and remove the duplicate records from different data sources. UDD, which for a given query, can effectively identify duplicates from the query result records of multiple web databases. Two cooperating classifiers, a Weighted Component Similarity Summing Classifier (WCSS) and Support Vector Machine (SVM) are used to iteratively identify the duplicate records from the non duplicate record and we also present a Genetic Programming (GP) approach to identify record deduplication. Since record deduplication is a time consuming task even for small repositories, our aim is to foster a method that finds a proper combination of the best pieces of evidence, thus yielding a deduplication function that maximizes performance using a small representative portion of the corresponding data for training purposes. We propose two more algorithms namely Particle Swarm Optimization (PSO), Bat Algorithm (BA) to improve the optimization. Index Terms – Data mining, duplicate records, genetic algorithm

Отказ от ответственности: Этот реферат был переведен с помощью инструментов искусственного интеллекта и еще не прошел проверку или верификацию

Индексировано в

Индекс Коперника
Академические ключи
CiteFactor
Космос ЕСЛИ
РефСик
Университет Хамдарда
Всемирный каталог научных журналов
Импакт-фактор Международного инновационного журнала (IIJIF)
Международный институт организованных исследований (I2OR)
Cosmos

Посмотреть больше