Deleting duplicates rows from redshift

后端未结

关注

 7  1957

I am trying to delete some duplicate data in my redshift table.

Below is my query:-

With duplicates
As
(Select *, ROW_NUMBER() Over (PARTITION by rec


                      
              相关标签:


      
      
        
          7条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  情书的邮戳        
                
              
                            
                2020-12-31 02:52
              
            
            
                                                                       
The following deletes all records in 'tablename' that have a duplicate, it will not deduplicate the table:

DELETE FROM tablename
WHERE id IN (
    SELECT id
    FROM (
          SELECT id,
          ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
          FROM tablename
         ) t
     WHERE t.rnum > 1);


Postgres administrative snippets
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤独总比滥情好        
                
              
                            
                2020-12-31 02:54
              
            
            
                                                                       
Redshift being what it is (no enforced uniqueness for any column), Ziggy's 3rd option is probably best. Once we decide to go the temp table route it is more efficient to swap things out whole. Deletes and inserts are expensive in Redshift.

begin;
create table table_name_new as select distinct * from table_name;
alter table table_name rename to table_name_old;
alter table table_name_new rename to table_name;
drop table table_name_old;
commit;


If space isn't an issue you can keep the old table around for a while and use the other methods described here to validate that the row count in the original accounting for duplicates matches the row count in the new.

If you're doing constant loads to such a table you'll want to pause that process while this is going on.

If the number of duplicates is a small percentage of a large table, you might want to try copying distinct records of the duplicates to a temp table, then delete all records from the original that join with the temp. Then append the temp table back to the original. Make sure you vacuum the original table after (which you should be doing for large tables on a schedule anyway).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  后悔当初        
                
              
                            
                2020-12-31 02:55
              
            
            
                                                                       
This method will preserve permissions and the table definition of the original_table


Create Table with unique rows


CREATE TABLE unique_table as
(
   SELECT DISTINCT * FROM original_table
)
;



Backup the original_table


CREATE TABLE backup_table as
(
   SELECT * FROM original_table
)
; 



Truncate the original_table


TRUNCATE original_table



Insert records from unique_table into original_table


INSERT INTO original_table
(
SELECT * FROM unique_table
)
;

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  半阙折子戏        
                
              
                            
                2020-12-31 03:02
              
            
            
                                                                       
Simple answer to this question:


Firstly create a temporary table from the main table where value of row_number=1.
Secondly delete all the rows from the main table on which we had duplicates.
Then insert the values of temporary table into the main table.


Queries:


Temporary table

select id,date into #temp_a 
from
    (select *

     from  (select a.*,
                   row_number() over(partition by id order by etl_createdon desc) as rn 
           from table a 
           where a.id between 59 and 75 and a.date = '2018-05-24') 
     where rn =1)a
deleting all the rows from the main table.

delete from table a 
 where a.id between 59 and 75 and a.date = '2018-05-24'
inserting all values from temp table to main table 

insert into table a select * from #temp_a.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  广开言路        
                
              
                            
                2020-12-31 03:03
              
            
            
                                                                       
That should have worked. Alternative you can do:

With 
  duplicates As (
    Select *, ROW_NUMBER() Over (PARTITION by record_indicator
                                 Order by record_indicator) as Duplicate
    From table_name)
delete from table_name
where id in (select id from duplicates Where Duplicate > 1);


or

delete from table_name
where id in (
  select id
  from (
    Select id, ROW_NUMBER() Over (PARTITION by record_indicator
                                 Order by record_indicator) as Duplicate
    From table_name) x
  Where Duplicate > 1);


If you have no primary key, you can do the following:

BEGIN;
CREATE TEMP TABLE mydups ON COMMIT DROP AS
  SELECT DISTINCT ON (record_indicator) *
  FROM table_name
  ORDER BY record_indicator --, other_optional_priority_field DESC
;

DELETE FROM table_name
WHERE record_indicator IN (
  SELECT record_indicator FROM mydups);

INSERT INTO table_name SELECT * FROM mydups;
COMMIT;

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  旧时难觅i        
                
              
                            
                2020-12-31 03:04
              
            
            
                                                                       
Your query does not work because Redshift does not allow DELETE after the WITH clause. Only SELECT and UPDATE and a few others are allowed (see WITH clause)

Solution (in my situation):

I did have an id column on my table events that contained duplicate rows and uniquely identifies the record. This column id is the same as your record_indicator.

Unfortunately I was unable to create a temporary table because I ran into the following error using SELECT DISTINCT:

ERROR:  Intermediate result row exceeds database block size

But this worked like a charm:

CREATE TABLE temp as (
    SELECT *,ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS rownumber 
    FROM events
);


resulting in the temp table:

id | rownumber | ...
----------------
1  | 1         | ...
1  | 2         | ...
2  | 1         | ...
2  | 2         | ...


Now the duplicates can be deleted by removing the rows having rownumber larger than 1:

DELETE FROM temp WHERE rownumber > 1


After that rename the tables and your done.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复