问题
We use BigQuery religiously and have two tables that essentially were updated in parallel by different process. The problem I have we don't have a unique identifier for tables and the goal is to combine the two tables with zero duplication if possible.. The unique identifier is two columns combined.
I've tried various MySQL-based queries, but none seem to work in BigQuery. So I am posting here for some assistance. :)
Step 1. Copy the "clean" table into a new merged table.
Step 2. Query the "dirty" (old) table and insert any missing entries.
Query Attempt 1:
SELECT
COUNT(c.*)
FROM
[flash-student-96619:device_data.device_datav3_20160530] AS old
WHERE NOT EXISTS (
SELECT
1
FROM
[flash-student-96619:device_data_v7_merged.20160530] AS new
WHERE
new.dsn = old.dsn
AND new.timestamp = old.timestamp
)
Error: error at: 6.1 - 10.65. Only one query can be executed at a time.
Query Attempt 2:
SELECT
*
FROM
[flash-student-96619:device_data.device_datav3_20160530]
WHERE
(dsn, timestamp) NOT IN (
SELECT
dsn,
timestamp
FROM
[flash-student-96619:device_data_v7_merged.20160530]
)
Error: Encountered " "," ", "" at line 6, column 7. Was expecting: ")" ...
Honestly, if I could do this in one query I would be happy. I need to fetch from two tables, and make a new one with unique data.
Any assistance?
回答1:
Something like below should work
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY dsn, timestamp) AS dup
FROM
[flash-student-96619:device_data.device_datav3_20160530],
[flash-student-96619:device_data_v7_merged.20160530]
)
WHERE dup = 1
I recommend using explicit list of fields instead of * in outer SELECT so you can omit dup from actual output
回答2:
A bit late, but I wanted to point out that your original query works with minor modifications using standard SQL (uncheck the "Use Legacy SQL" box under "Show Options"). I just had to change new
to something else, since that's a reserved keyword. For example, this query is valid:
WITH OldData AS (
SELECT
x AS dsn,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL x HOUR) AS timestamp
FROM UNNEST([1, 2, 3, 4]) AS x),
NewData AS (
SELECT
x AS dsn,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL x HOUR) AS timestamp
FROM UNNEST([5, 2, 1, 6]) AS x)
SELECT
COUNT(*)
FROM OldData oldData
WHERE NOT EXISTS (
SELECT 1
FROM NewData newData
WHERE
newData.dsn = oldData.dsn
AND newData.timestamp = oldData.timestamp
);
+-----+
| f0_ |
+-----+
| 2 |
+-----+
In regard to your second attempt, you can do:
WITH OldData AS (
SELECT
x AS dsn,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL x HOUR) AS timestamp
FROM UNNEST([1, 2, 3, 4]) AS x),
NewData AS (
SELECT
x AS dsn,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL x HOUR) AS timestamp
FROM UNNEST([5, 2, 1, 6]) AS x)
SELECT
*
FROM OldData
WHERE
STRUCT(dsn, timestamp) NOT IN (
SELECT AS STRUCT
dsn,
timestamp
FROM NewData);
+-----+---------------------+
| dsn | timestamp |
+-----+---------------------+
| 3 | 2016-07-21 11:54:08 |
| 4 | 2016-07-21 10:54:08 |
+-----+---------------------+
来源:https://stackoverflow.com/questions/38446499/bigquery-deduplication-on-two-columns-as-unique-key