Bail out data duplicates with SQL

Bail out data duplicates with SQL

A good relational database model should avoid data duplicates. But sometimes you get working on a migration project. Maybe a project with a bad data basis.

The migration process seems to work, it looks good. Then it happens: the migration script aborts with a constraint error. You dig into the problem and discover that the suspicious table actually contains duplicate data, without a unique ID to distinguish the records, of course.

Here are two examples of how you can solve the problem in PostgreSQL or Oracle:

PostgreSQL solution

Firstly the PostgreSQL way. We create data with a duplicate.

playground=# create table mydata(name text);
CREATE TABLE
playground=# insert into mydata values ('Atari');
INSERT 0 1
playground=# insert into mydata values ('Commodore');
INSERT 0 1
playground=# insert into mydata values ('Sinclair ZX Spectrum');
INSERT 0 1
playground=# select * from mydata;
         name         
----------------------
 Atari
 Commodore
 Sinclair ZX Spectrum
(3 rows)

playground=# insert into mydata values ('Atari');
INSERT 0 1
playground=# select * from mydata;
         name         
----------------------
 Atari
 Commodore
 Sinclair ZX Spectrum
 Atari
(4 rows)

playground=# select name, count(*) from mydata group by name;
         name         | count 
----------------------+-------
 Commodore            |     1
 Sinclair ZX Spectrum |     1
 Atari                |     2
(3 rows)

And now remove the duplicates with this command:

playground=# delete from mydata a using  mydata b where a=b and a.ctid <  b.ctid;
DELETE 1
playground=# select * from mydata;
         name         
----------------------
 Commodore
 Sinclair ZX Spectrum
 Atari
(3 rows)

playground=# select name, count(*) from mydata group by name;
         name         | count 
----------------------+-------
 Commodore            |     1
 Sinclair ZX Spectrum |     1
 Atari                |     1
(3 rows)

Because both duplicate records were equal it doesn't matter which is really deleted internally.

Oracle solution

Actually the same logic but slightly different:

delete from mydata where rowid in (select a.rowid from mydata a, mydata b where a.name=b.name and a.rowid < b.rowid);

Same as mentioned above, it's irrelevant which duplicate is removed internally.