BigQuery: Deleting Duplicates in Partitioned Table

问题

I have BQ table that is partitioned by insert time. I'm trying to remove duplicates from the table. These are true duplicates: for 2 duplicate rows, all columns are equal - of course having a unique key might have been helpful :-(

At first I tried a SELECT query to enumerate duplicates and remove them:

SELECT
    * EXCEPT(row_number)
FROM (
    SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id_column) row_number
    FROM
    `mytable`)
WHERE
    row_number = 1

This results in unique rows but creates a new table that doesn't include the partition data - so not good.

I've seen this answer here which states the only way to retain partitions is to go over them one-by-one with the above query and save to a specific target table partition.

What I'd really want to do is use a DML DELETE to remove the duplicate rows in place. I tried something similar to what this answer suggested:

DELETE
FROM `mytable` AS d
WHERE (SELECT ROW_NUMBER() OVER (PARTITION BY id_column)
   FROM `mytable ` AS d2
   WHERE d.id = d2.id) > 1;

But the accepted answer doesn't work and results in a BQ error:

Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN

Would be great if anyone could offer a simpler (DML or otherwise) way to deal with this so I won't be required to loop over all partitions individually.

回答1:

Kind of a hack, but you can use the MERGE statement to delete all of the contents of the table and reinsert only distinct rows atomically. Here's an example:

-- Create a table with some duplicate rows
CREATE TABLE dataset.PartitionedTable
PARTITION BY date AS
SELECT x, CONCAT('foo', CAST(x AS STRING)) AS y, DATE_SUB(CURRENT_DATE(), INTERVAL x DAY) AS date
FROM UNNEST(GENERATE_ARRAY(1, 10)) AS x, UNNEST(GENERATE_ARRAY(1, 10));

Now for the MERGE part:

-- Execute a MERGE statement where all original rows are deleted,
-- then replaced with new, deduplicated rows:
MERGE dataset.PartitionedTable AS t1
USING (SELECT DISTINCT * FROM dataset.PartitionedTable) AS t2
ON FALSE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
WHEN NOT MATCHED BY SOURCE THEN DELETE

来源：https://stackoverflow.com/questions/53650596/bigquery-deleting-duplicates-in-partitioned-table

标签

google-bigquery

bigquery-standard-sql