I have a question about MySql. I have a table with 7.479.194 records. Some records are duplicated. I would like to do this:
insert into new_table
select *
A bit dirty maybe, but it has done the trick for me the few times that I've needed it: Remove duplicate entries in MySQL.
Basically, you simply create a unique index consisting of all the columns that you wan't to be unique in the table.
As always before this kind of procedures, a backup before proceeding is recommended.
From my experience when your table grows to number of millions records and more the most effective way to handle duplicates will: 1) export data to text files 2) sort in file 3) remove duplicates in file 4) load back to database
With increasing size of the data this approach works eventually faster than any SQL query you may invent
To avoid the memory issue, avoid the big select by having a small external program, using the logic as below. First, backup your database. Then:
do {
# find a record
x=sql: select * from table1 limit 1;
if (null x)
then
exit # no more data in table1
fi
insert x into table2
# find the value of the field that should NOT be duplicated
a=parse(x for table1.a)
# delete all such entries from table1
sql: delete * from table1 where a='$a';
}
You don't need to group data. Try this:
delete from old_table
USING old_table, old_table as vtable
WHERE (old_table.id > vtable.id)
AND (old_table.city=vtable.city AND
old_table.post_code=vtable.post_code
AND old_table.short_code=vtable.short_code)
I can't comment posts becouse of my points ... repair table old_table; next: show:
EXPLAIN SELECT old_table.id FROM old_table, old_table as vtable
WHERE (old_table.id > vtable.id)
AND (old_table.city=vtable.city AND
old_table.post_code=vtable.post_code
AND old_table.short_code=vtable.short_code
Show: os~> ulimit -a; mysql>SHOW VARIABLES LIKE 'open_files_limit';
next: Remove all os restrictions form the mysql process.
ulimit -n 1024 etc.
MySQL has a INSERT IGNORE. From the docs:
[...] however, when INSERT IGNORE is used, the insert operation fails silently for the row containing the unmatched value, but any rows that are matched are inserted.
So you could use your query from above b just adding a IGNORE
This will populate NEW_TABLE
with unique values, and the id
value is the first id of the bunch:
INSERT INTO NEW_TABLE
SELECT MIN(ot.id),
ot.city,
ot.post_code,
ot.short_ccode
FROM OLD_TABLE ot
GROUP BY ot.city, ot.post_code, ot.short_ccode
If you want the highest id value per bunch:
INSERT INTO NEW_TABLE
SELECT MAX(ot.id),
ot.city,
ot.post_code,
ot.short_ccode
FROM OLD_TABLE ot
GROUP BY ot.city, ot.post_code, ot.short_ccode