Copy data from Amazon S3 to Redshift and avoid duplicate rows

前端 未结 4 1037
深忆病人
深忆病人 2021-02-01 10:03

I am copying data from Amazon S3 to Redshift. During this process, I need to avoid the same files being loaded again. I don\'t have any unique constraints on my Redshift table.

4条回答
  •  无人及你
    2021-02-01 10:32

    Currently there is no way to remove duplicates from redshift. Redshift doesn't support primary key/unique key constraints, and also removing duplicates using row number is not an option (deleting rows with row number greater than 1) as the delete operation on redshift doesn't allow complex statements (Also the concept of row number is not present in redshift).

    The best way to remove duplicates is to write a cron/quartz job that would select all the distinct rows, put them in a separate table and then rename the table to your original table.

    Insert into temp_originalTable (Select Distinct from originalTable)

    Drop table originalTable

    Alter table temp_originalTable rename to originalTable

提交回复
热议问题