Deleting duplicates rows from redshift

后端 未结 7 1958
南方客
南方客 2020-12-31 02:45

I am trying to delete some duplicate data in my redshift table.

Below is my query:-

With duplicates
As
(Select *, ROW_NUMBER() Over (PARTITION by rec         


        
相关标签:
7条回答
  • 2020-12-31 03:15

    If you're dealing with a lot of data it's not always possible or smart to recreate the whole table. It may be easier to locate, delete those rows:

    -- First identify all the rows that are duplicate
    CREATE TEMP TABLE duplicate_saleids AS
    SELECT saleid
    FROM sales
    WHERE saledateid BETWEEN 2224 AND 2231
    GROUP BY saleid
    HAVING COUNT(*) > 1;
    
    -- Extract one copy of all the duplicate rows
    CREATE TEMP TABLE new_sales(LIKE sales);
    
    INSERT INTO new_sales
    SELECT DISTINCT *
    FROM sales
    WHERE saledateid BETWEEN 2224 AND 2231
    AND saleid IN(
         SELECT saleid
         FROM duplicate_saleids
    );
    
    -- Remove all rows that were duplicated (all copies).
    DELETE FROM sales
    WHERE saledateid BETWEEN 2224 AND 2231
    AND saleid IN(
         SELECT saleid
         FROM duplicate_saleids
    );
    
    -- Insert back in the single copies
    INSERT INTO sales
    SELECT *
    FROM new_sales;
    
    -- Cleanup
    DROP TABLE duplicate_saleids;
    DROP TABLE new_sales;
    
    COMMIT;
    

    Full article: https://elliot.land/post/removing-duplicate-data-in-redshift

    0 讨论(0)
提交回复
热议问题