问题
I am facing an issue related to a project of mine. Here is the summary of what i would like to do :
I have a big daily file (100 Go) with the following extract (no header) :
ID_A|segment_1
ID_A|segment_2
ID_B|segment_2
ID_B|segment_3
ID_B|segment_4
ID_B|segment_5
ID_C|segment_1
ID_D|segment_2
ID_D|segment_4
Every ID (from A to D) can be linked to one or multiple segments (from 1 to 5).
I would like to process this file in order to have the following result (the result file contains a header) :
ID|segment_1|segment_2|segment_3|segment_4|segment_5
ID_A|1|1|0|0|0
ID_B|0|1|1|1|1
ID_C|1|0|0|0|0
ID_D|0|1|0|1|0
1 means that the ID is included in the segment, 0 means that it is not.
I am using the following query to get the result :
select id,
countif(segment = 'segment_1') as segment_1,
countif(segment = 'segment_2') as segment_2,
countif(segment = 'segment_3') as segment_3,
countif(segment = 'segment_4') as segment_4,
countif(segment = 'segment_5') as segment_5
from staging s cross join
unnest(split(segments, ',')) as segment
group by id;
This solution worked for me until the number of segments became a lot higher (900+ segments instead of 5 in my first example). This is creating a huge query that cannot be passed as an argument through bq
cli.
Is there any workaround that i can use ?
Thanks to all for your help.
Regards
回答1:
Below is for BigQuery Standard SQL
EXECUTE IMMEDIATE '''
SELECT id, ''' || (
SELECT STRING_AGG("COUNTIF(segment = '" || segment || "') AS " || segment ORDER BY segment)
FROM (SELECT DISTINCT segment FROM staging)
) || '''
FROM staging
GROUP BY 1
ORDER BY 1
'''
If to apply to sample data in your question - output is
Row id segment_1 segment_2 segment_3 segment_4 segment_5
1 ID_A 1 1 0 0 0
2 ID_B 0 1 1 1 1
3 ID_C 1 0 0 0 0
4 ID_D 0 1 0 1 0
and as you can see you don't need to worry about number and naming of segments - it is taken care of by above query
来源:https://stackoverflow.com/questions/63105338/bigquery-use-conditions-to-create-a-table-from-other-tables-manage-big-number-o