Remove duplicates in large MySql table

前端未结

关注

 6  2025

I have a question about MySql. I have a table with 7.479.194 records. Some records are duplicated. I would like to do this:

insert into new_table 
  select *


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  情歌与酒        
                
              
                            
                2021-01-06 15:18
              
            
            
                                                                       
A bit dirty maybe, but it has done the trick for me the few times that I've needed it: 
Remove duplicate entries in MySQL.

Basically, you simply create a unique index consisting of all the columns that you wan't to be unique in the table.

As always before this kind of procedures, a backup before proceeding is recommended.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  Happy的楠姐        
                
              
                            
                2021-01-06 15:24
              
            
            
                                                                       
From my experience when your table grows to number of millions records and more the most effective way to handle duplicates will:
1) export data to text files
2) sort in file
3) remove duplicates in file
4) load back to database

With increasing size of the data this approach works eventually faster than any SQL query you may invent
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  借酒劲吻你        
                
              
                            
                2021-01-06 15:28
              
            
            
                                                                       
To avoid the memory issue, avoid the big select by having a small external program,
using the logic as below.
First, backup your database.
Then:

do {
# find a record
x=sql: select * from table1 limit 1;
if (null x)
then
 exit # no more data in table1
fi
insert x into table2

# find the value of the field that should NOT be duplicated
a=parse(x for table1.a)
# delete all such entries from table1
sql: delete * from table1 where a='$a';

}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  太阳男子        
                
              
                            
                2021-01-06 15:28
              
            
            
                                                                       
You don't need to group data. Try this:   

 delete from old_table
    USING old_table, old_table as vtable  
    WHERE (old_table.id > vtable.id)  
    AND (old_table.city=vtable.city AND 
old_table.post_code=vtable.post_code 
AND old_table.short_code=vtable.short_code) 


I can't comment posts becouse of my points ...
repair table old_table;
next:
show:

EXPLAIN SELECT old_table.id FROM   old_table, old_table as vtable  
        WHERE (old_table.id > vtable.id)  
        AND (old_table.city=vtable.city AND 
    old_table.post_code=vtable.post_code 
    AND old_table.short_code=vtable.short_code


Show:
os~> ulimit -a; 
mysql>SHOW VARIABLES LIKE 'open_files_limit';

next:
Remove all os  restrictions form the mysql process.

ulimit -n 1024 etc.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  遥遥无期        
                
              
                            
                2021-01-06 15:29
              
            
            
                                                                       
MySQL has a INSERT IGNORE. From the docs:


  [...] however, when INSERT IGNORE is
  used, the insert operation fails
  silently for the row containing the
  unmatched value, but any rows that are
  matched are inserted.


So you could use your query from above b just adding a IGNORE
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  无人及你        
                
              
                            
                2021-01-06 15:37
              
            
            
                                                                       
This will populate NEW_TABLE with unique values, and the id value is the first id of the bunch:

INSERT INTO NEW_TABLE
  SELECT MIN(ot.id),
         ot.city,
         ot.post_code,
         ot.short_ccode
    FROM OLD_TABLE ot
GROUP BY ot.city, ot.post_code, ot.short_ccode


If you want the highest id value per bunch:

INSERT INTO NEW_TABLE
  SELECT MAX(ot.id),
         ot.city,
         ot.post_code,
         ot.short_ccode
    FROM OLD_TABLE ot
GROUP BY ot.city, ot.post_code, ot.short_ccode

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复