Copy data from Amazon S3 to Redshift and avoid duplicate rows

前端 未结 4 1044
深忆病人
深忆病人 2021-02-01 10:03

I am copying data from Amazon S3 to Redshift. During this process, I need to avoid the same files being loaded again. I don\'t have any unique constraints on my Redshift table.

4条回答
  •  盖世英雄少女心
    2021-02-01 10:31

    As user1045047 mentioned, Amazon Redshift doesn't support unique constraints, so I had been looking for the way to delete duplicate records from a table with a delete statement. Finally, I found out a reasonable way.

    Amazon Redshift supports creating an IDENTITY column that is stored an auto-generated unique number. http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html

    The following sql is for PostgreSQL to delete duplicated records with OID that is unique column, and you can use this sql by replacing OID with the identity column.

    DELETE FROM duplicated_table WHERE OID > (
     SELECT MIN(OID) FROM duplicated_table d2
      WHERE column1 = d2.dupl_column1
      AND column2 = d2.column2
    );
    

    Here is an example that I tested on my Amazon Redshift cluster.

    create table auto_id_table (auto_id int IDENTITY, name varchar, age int);
    
    insert into auto_id_table (name, age) values('John', 18);
    insert into auto_id_table (name, age) values('John', 18);
    insert into auto_id_table (name, age) values('John', 18);
    insert into auto_id_table (name, age) values('John', 18);
    insert into auto_id_table (name, age) values('John', 18);
    insert into auto_id_table (name, age) values('Bob', 20);
    insert into auto_id_table (name, age) values('Bob', 20);  
    insert into auto_id_table (name, age) values('Matt', 24); 
    
    select * from auto_id_table order by auto_id; 
     auto_id | name | age 
    ---------+------+-----
           1 | John |  18
           2 | John |  18
           3 | John |  18
           4 | John |  18
           5 | John |  18
           6 | Bob  |  20
           7 | Bob  |  20
           8 | Matt |  24    
    (8 rows) 
    
    delete from auto_id_table where auto_id > (
      select min(auto_id) from auto_id_table d
        where auto_id_table.name = d.name
        and auto_id_table.age = d.age
    );
    
    select * from auto_id_table order by auto_id;
     auto_id | name | age 
    ---------+------+-----
           1 | John |  18
           6 | Bob  |  20
           8 | Matt |  24
    (3 rows)
    

    Also it works with COPY command like this.

    • auto_id_table.csv

      John,18
      Bob,20
      Matt,24
      
    • copy sql

      copy auto_id_table (name, age) from '[s3-path]/auto_id_table.csv' CREDENTIALS 'aws_access_key_id=[your-aws-key-id] ;aws_secret_access_key=[your-aws-secret-key]' delimiter ','; 
      

    The advantage of this way is that you don't need to run DDL statements. However it doesn't work with existing tables that do not have an identity column because an identity column cannot be added to an existing table. The only way to delete duplicated records with existing tables is migrating all records like this. (same as user1045047's answer)

    insert into temp_table (select distinct from original_table);
    drop table original_table;
    alter table temp_table rename to original_table;
    

提交回复
热议问题