How to delete duplicates on a MySQL table?

后端未结

关注

 25  2424

I need to DELETE duplicated rows for specified sid on a MySQL table.

How can I do this with an SQL query?


                      
              相关标签:


      
      
        
          25条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  长发绾君心        
                
              
                            
                2020-11-22 02:09
              
            
            
                                                                       
This here will make the column column_name into a primary key, and in the meantime ignore all errors. So it will delete the rows with a duplicate value for column_name.

ALTER IGNORE TABLE `table_name` ADD PRIMARY KEY (`column_name`);

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  生来不讨喜        
                
              
                            
                2020-11-22 02:10
              
            
            
                                                                       
Could it work if you count them, and then add a limit to your delete query leaving just one?

For example, if you have two or more, write your query like this:

DELETE FROM table WHERE SID = 1 LIMIT 1;

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轻奢々        
                
              
                            
                2020-11-22 02:11
              
            
            
                                                                       
delete p from 
product p
inner join (
    select max(id) as id, url from product 
    group by url 
    having count(*) > 1
) unik on unik.url = p.url and unik.id != p.id;

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  无人共我        
                
              
                            
                2020-11-22 02:13
              
            
            
                                                                       
Deleting duplicate rows in MySQL in-place, (Assuming you have a timestamp col to sort by) walkthrough:

Create the table and insert some rows:

create table penguins(foo int, bar varchar(15), baz datetime);
insert into penguins values(1, 'skipper', now());
insert into penguins values(1, 'skipper', now());
insert into penguins values(3, 'kowalski', now());
insert into penguins values(3, 'kowalski', now());
insert into penguins values(3, 'kowalski', now());
insert into penguins values(4, 'rico', now());
select * from penguins;
    +------+----------+---------------------+
    | foo  | bar      | baz                 |
    +------+----------+---------------------+
    |    1 | skipper  | 2014-08-25 14:21:54 |
    |    1 | skipper  | 2014-08-25 14:21:59 |
    |    3 | kowalski | 2014-08-25 14:22:09 |
    |    3 | kowalski | 2014-08-25 14:22:13 |
    |    3 | kowalski | 2014-08-25 14:22:15 |
    |    4 | rico     | 2014-08-25 14:22:22 |
    +------+----------+---------------------+
6 rows in set (0.00 sec)


Remove the duplicates in place:

delete a
    from penguins a
    left join(
    select max(baz) maxtimestamp, foo, bar
    from penguins
    group by foo, bar) b
    on a.baz = maxtimestamp and
    a.foo = b.foo and
    a.bar = b.bar
    where b.maxtimestamp IS NULL;
Query OK, 3 rows affected (0.01 sec)
select * from penguins;
+------+----------+---------------------+
| foo  | bar      | baz                 |
+------+----------+---------------------+
|    1 | skipper  | 2014-08-25 14:21:59 |
|    3 | kowalski | 2014-08-25 14:22:15 |
|    4 | rico     | 2014-08-25 14:22:22 |
+------+----------+---------------------+
3 rows in set (0.00 sec)


You're done, duplicate rows are removed, last one by timestamp is kept.

For those of you without a timestamp or unique column.

You don't have a timestamp or a unique index column to sort by?  You're living in a state of degeneracy.  You'll have to do additional steps to delete duplicate rows.

create the penguins table and add some rows 

create table penguins(foo int, bar varchar(15)); 
insert into penguins values(1, 'skipper'); 
insert into penguins values(1, 'skipper'); 
insert into penguins values(3, 'kowalski'); 
insert into penguins values(3, 'kowalski'); 
insert into penguins values(3, 'kowalski'); 
insert into penguins values(4, 'rico'); 
select * from penguins; 
    # +------+----------+ 
    # | foo  | bar      | 
    # +------+----------+ 
    # |    1 | skipper  | 
    # |    1 | skipper  | 
    # |    3 | kowalski | 
    # |    3 | kowalski | 
    # |    3 | kowalski | 
    # |    4 | rico     | 
    # +------+----------+ 


make a clone of the first table and copy into it. 

drop table if exists penguins_copy; 
create table penguins_copy as ( SELECT foo, bar FROM penguins );  

#add an autoincrementing primary key: 
ALTER TABLE penguins_copy ADD moo int AUTO_INCREMENT PRIMARY KEY first; 

select * from penguins_copy; 
    # +-----+------+----------+ 
    # | moo | foo  | bar      | 
    # +-----+------+----------+ 
    # |   1 |    1 | skipper  | 
    # |   2 |    1 | skipper  | 
    # |   3 |    3 | kowalski | 
    # |   4 |    3 | kowalski | 
    # |   5 |    3 | kowalski | 
    # |   6 |    4 | rico     | 
    # +-----+------+----------+ 


The max aggregate operates upon the new moo index: 

delete a from penguins_copy a left join( 
    select max(moo) myindex, foo, bar 
    from penguins_copy 
    group by foo, bar) b 
    on a.moo = b.myindex and 
    a.foo = b.foo and 
    a.bar = b.bar 
    where b.myindex IS NULL; 

#drop the extra column on the copied table 
alter table penguins_copy drop moo; 
select * from penguins_copy; 

#drop the first table and put the copy table back: 
drop table penguins; 
create table penguins select * from penguins_copy; 


observe and cleanup 

drop table penguins_copy; 
select * from penguins;
+------+----------+ 
| foo  | bar      | 
+------+----------+ 
|    1 | skipper  | 
|    3 | kowalski | 
|    4 | rico     | 
+------+----------+ 
    Elapsed: 1458.359 milliseconds 


What's that big SQL delete statement doing?

Table penguins with alias 'a' is left joined on a subset of table penguins called alias 'b'.  The right hand table 'b' which is a subset finds the max timestamp [ or max moo ] grouped by columns foo and bar.  This is matched to left hand table 'a'.  (foo,bar,baz) on left has every row in the table.  The right hand subset 'b' has a (maxtimestamp,foo,bar) which is matched to left only on the one that IS the max.

Every row that is not that max has value maxtimestamp of NULL.  Filter down on those NULL rows and you have a set of all rows grouped by foo and bar that isn't the latest timestamp baz.  Delete those ones.  

Make a backup of the table before you run this.  

Prevent this problem from ever happening again on this table:

If you got this to work, and it put out your "duplicate row" fire.  Great.  Now define a new composite unique key on your table (on those two columns) to prevent more duplicates from being added in the first place.  

Like a good immune system, the bad rows shouldn't even be allowed in to the table at the time of insert.  Later on all those programs adding duplicates will broadcast their protest, and when you fix them, this issue never comes up again.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  渐次进展        
                
              
                            
                2020-11-22 02:13
              
            
            
                                                                       
After running into this issue myself, on a huge database, I wasn't completely impressed with the performance of any of the other answers.  I want to keep only the latest duplicate row, and delete the rest.

In a one-query statement, without a temp table, this worked best for me,

DELETE e.*
FROM employee e
WHERE id IN
 (SELECT id
   FROM (SELECT MIN(id) as id
          FROM employee e2
          GROUP BY first_name, last_name
          HAVING COUNT(*) > 1) x);


The only caveat is that I have to run the query multiple times, but even with that, I found it worked better for me than the other options.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  傲寒        
                
              
                            
                2020-11-22 02:13
              
            
            
                                                                       
You could just use a DISTINCT clause to select the "cleaned up" list (and here is a very easy example on how to do that).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
3
4
5
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复