How to delete duplicates on a MySQL table?

后端 未结 25 2400
遇见更好的自我
遇见更好的自我 2020-11-22 01:35

I need to DELETE duplicated rows for specified sid on a MySQL table.

How can I do this with an SQL query?

         


        
相关标签:
25条回答
  • 2020-11-22 02:09

    This here will make the column column_name into a primary key, and in the meantime ignore all errors. So it will delete the rows with a duplicate value for column_name.

    ALTER IGNORE TABLE `table_name` ADD PRIMARY KEY (`column_name`);
    
    0 讨论(0)
  • 2020-11-22 02:10

    Could it work if you count them, and then add a limit to your delete query leaving just one?

    For example, if you have two or more, write your query like this:

    DELETE FROM table WHERE SID = 1 LIMIT 1;
    
    0 讨论(0)
  • 2020-11-22 02:11
    delete p from 
    product p
    inner join (
        select max(id) as id, url from product 
        group by url 
        having count(*) > 1
    ) unik on unik.url = p.url and unik.id != p.id;
    
    0 讨论(0)
  • 2020-11-22 02:13

    Deleting duplicate rows in MySQL in-place, (Assuming you have a timestamp col to sort by) walkthrough:

    Create the table and insert some rows:

    create table penguins(foo int, bar varchar(15), baz datetime);
    insert into penguins values(1, 'skipper', now());
    insert into penguins values(1, 'skipper', now());
    insert into penguins values(3, 'kowalski', now());
    insert into penguins values(3, 'kowalski', now());
    insert into penguins values(3, 'kowalski', now());
    insert into penguins values(4, 'rico', now());
    select * from penguins;
        +------+----------+---------------------+
        | foo  | bar      | baz                 |
        +------+----------+---------------------+
        |    1 | skipper  | 2014-08-25 14:21:54 |
        |    1 | skipper  | 2014-08-25 14:21:59 |
        |    3 | kowalski | 2014-08-25 14:22:09 |
        |    3 | kowalski | 2014-08-25 14:22:13 |
        |    3 | kowalski | 2014-08-25 14:22:15 |
        |    4 | rico     | 2014-08-25 14:22:22 |
        +------+----------+---------------------+
    6 rows in set (0.00 sec)
    

    Remove the duplicates in place:

    delete a
        from penguins a
        left join(
        select max(baz) maxtimestamp, foo, bar
        from penguins
        group by foo, bar) b
        on a.baz = maxtimestamp and
        a.foo = b.foo and
        a.bar = b.bar
        where b.maxtimestamp IS NULL;
    Query OK, 3 rows affected (0.01 sec)
    select * from penguins;
    +------+----------+---------------------+
    | foo  | bar      | baz                 |
    +------+----------+---------------------+
    |    1 | skipper  | 2014-08-25 14:21:59 |
    |    3 | kowalski | 2014-08-25 14:22:15 |
    |    4 | rico     | 2014-08-25 14:22:22 |
    +------+----------+---------------------+
    3 rows in set (0.00 sec)
    

    You're done, duplicate rows are removed, last one by timestamp is kept.

    For those of you without a timestamp or unique column.

    You don't have a timestamp or a unique index column to sort by? You're living in a state of degeneracy. You'll have to do additional steps to delete duplicate rows.

    create the penguins table and add some rows

    create table penguins(foo int, bar varchar(15)); 
    insert into penguins values(1, 'skipper'); 
    insert into penguins values(1, 'skipper'); 
    insert into penguins values(3, 'kowalski'); 
    insert into penguins values(3, 'kowalski'); 
    insert into penguins values(3, 'kowalski'); 
    insert into penguins values(4, 'rico'); 
    select * from penguins; 
        # +------+----------+ 
        # | foo  | bar      | 
        # +------+----------+ 
        # |    1 | skipper  | 
        # |    1 | skipper  | 
        # |    3 | kowalski | 
        # |    3 | kowalski | 
        # |    3 | kowalski | 
        # |    4 | rico     | 
        # +------+----------+ 
    

    make a clone of the first table and copy into it.

    drop table if exists penguins_copy; 
    create table penguins_copy as ( SELECT foo, bar FROM penguins );  
    
    #add an autoincrementing primary key: 
    ALTER TABLE penguins_copy ADD moo int AUTO_INCREMENT PRIMARY KEY first; 
    
    select * from penguins_copy; 
        # +-----+------+----------+ 
        # | moo | foo  | bar      | 
        # +-----+------+----------+ 
        # |   1 |    1 | skipper  | 
        # |   2 |    1 | skipper  | 
        # |   3 |    3 | kowalski | 
        # |   4 |    3 | kowalski | 
        # |   5 |    3 | kowalski | 
        # |   6 |    4 | rico     | 
        # +-----+------+----------+ 
    

    The max aggregate operates upon the new moo index:

    delete a from penguins_copy a left join( 
        select max(moo) myindex, foo, bar 
        from penguins_copy 
        group by foo, bar) b 
        on a.moo = b.myindex and 
        a.foo = b.foo and 
        a.bar = b.bar 
        where b.myindex IS NULL; 
    
    #drop the extra column on the copied table 
    alter table penguins_copy drop moo; 
    select * from penguins_copy; 
    
    #drop the first table and put the copy table back: 
    drop table penguins; 
    create table penguins select * from penguins_copy; 
    

    observe and cleanup

    drop table penguins_copy; 
    select * from penguins;
    +------+----------+ 
    | foo  | bar      | 
    +------+----------+ 
    |    1 | skipper  | 
    |    3 | kowalski | 
    |    4 | rico     | 
    +------+----------+ 
        Elapsed: 1458.359 milliseconds 
    

    What's that big SQL delete statement doing?

    Table penguins with alias 'a' is left joined on a subset of table penguins called alias 'b'. The right hand table 'b' which is a subset finds the max timestamp [ or max moo ] grouped by columns foo and bar. This is matched to left hand table 'a'. (foo,bar,baz) on left has every row in the table. The right hand subset 'b' has a (maxtimestamp,foo,bar) which is matched to left only on the one that IS the max.

    Every row that is not that max has value maxtimestamp of NULL. Filter down on those NULL rows and you have a set of all rows grouped by foo and bar that isn't the latest timestamp baz. Delete those ones.

    Make a backup of the table before you run this.

    Prevent this problem from ever happening again on this table:

    If you got this to work, and it put out your "duplicate row" fire. Great. Now define a new composite unique key on your table (on those two columns) to prevent more duplicates from being added in the first place.

    Like a good immune system, the bad rows shouldn't even be allowed in to the table at the time of insert. Later on all those programs adding duplicates will broadcast their protest, and when you fix them, this issue never comes up again.

    0 讨论(0)
  • 2020-11-22 02:13

    After running into this issue myself, on a huge database, I wasn't completely impressed with the performance of any of the other answers. I want to keep only the latest duplicate row, and delete the rest.

    In a one-query statement, without a temp table, this worked best for me,

    DELETE e.*
    FROM employee e
    WHERE id IN
     (SELECT id
       FROM (SELECT MIN(id) as id
              FROM employee e2
              GROUP BY first_name, last_name
              HAVING COUNT(*) > 1) x);
    

    The only caveat is that I have to run the query multiple times, but even with that, I found it worked better for me than the other options.

    0 讨论(0)
  • 2020-11-22 02:13

    You could just use a DISTINCT clause to select the "cleaned up" list (and here is a very easy example on how to do that).

    0 讨论(0)
提交回复
热议问题