I am copying data from Amazon S3 to Redshift. During this process, I need to avoid the same files being loaded again. I don\'t have any unique constraints on my Redshift table.
Currently there is no way to remove duplicates from redshift. Redshift doesn't support primary key/unique key constraints, and also removing duplicates using row number is not an option (deleting rows with row number greater than 1) as the delete operation on redshift doesn't allow complex statements (Also the concept of row number is not present in redshift).
The best way to remove duplicates is to write a cron/quartz job that would select all the distinct rows, put them in a separate table and then rename the table to your original table.
Insert into temp_originalTable (Select Distinct from originalTable)
Drop table originalTable
Alter table temp_originalTable rename to originalTable