How to scale Pivoting in BigQuery?

后端 未结 1 1061
名媛妹妹
名媛妹妹 2020-11-27 23:45

Let\'s say, I have music video play stats table mydataset.stats for a given day (3B rows, 1M users, 6K artists). Simplified schema is: UserGUID String, ArtistGUID String

相关标签:
1条回答
  • 2020-11-28 00:04

    I tried below approach for up to 6000 features and it worked as expected. I believe it will work up to 10K features which is hard limit for number of columns in a table

    STEP 1 - Aggregate plays by user / artist

    SELECT userGUID as uid, artistGUID as aid, COUNT(1) as plays 
    FROM [mydataset.stats] GROUP BY 1, 2
    

    STEP 2 – Normalize uid and aid – so they are consecutive numbers 1, 2, 3, … .
    We need this at least for two reasons: a) make later dynamically created sql as compact as possible and b) to have more usable/friendly columns names

    Combined with first step – it will be:

    SELECT u.uid AS uid, a.aid AS aid, plays 
    FROM (
      SELECT userGUID, artistGUID, COUNT(1) AS plays 
      FROM [mydataset.stats] 
      GROUP BY 1, 2
    ) AS s
    JOIN (
      SELECT userGUID, ROW_NUMBER() OVER() AS uid FROM [mydataset.stats] GROUP BY 1
    ) AS u ON u. userGUID = s.userGUID
    JOIN (
      SELECT artistGUID, ROW_NUMBER() OVER() AS aid FROM [mydataset.stats] GROUP BY 1
    ) AS a ON a.artistGUID = s.artistGUID 
    

    Let’s write output to table - mydataset.aggs

    STEP 3 – Using already suggested (in above mentioned questions) approach for N features (artists) at a time. In my particular example, by experimenting, I found that basic approach works well for number of features between 2000 and 3000. To be on safe side I decided to use 2000 features at a time

    Below script is used for dynamically generating query that then run to create partitioned tables

    SELECT 'SELECT uid,' + 
       GROUP_CONCAT_UNQUOTED(
          'SUM(IF(aid=' + STRING(aid) + ',plays,NULL)) as a' + STRING(aid) 
       ) 
       + ' FROM [mydataset.aggs] GROUP EACH BY uid'
    FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid > 0 and aid < 2001)
    

    Above query produces yet another query like below:

    SELECT uid,SUM(IF(aid=1,plays,NULL)) a1,SUM(IF(aid=3,plays,NULL)) a3,
      SUM(IF(aid=2,plays,NULL)) a2,SUM(IF(aid=4,plays,NULL)) a4 . . .
    FROM [mydataset.aggs] GROUP EACH BY uid 
    

    This should be run and written to mydataset.pivot_1_2000

    Executing STEP 3 two more times (adjusting HAVING aid > NNNN and aid < NNNN) we get three more tables mydataset.pivot_2001_4000, mydataset.pivot_4001_6000
    As you can see - mydataset.pivot_1_2000 has expected schema but for features with aid from 1 to 2001; mydataset.pivot_2001_4000 has only features with aid from 2001 to 4000; and so on

    STEP 4 – Merging all partitioned pivot table to final pivot table with all features represented as columns in one table

    Same as in above steps. First we need generate query and then run it So, initially we will “stitch” mydataset.pivot_1_2000 and mydataset.pivot_2001_4000. Then result with mydataset.pivot_4001_6000

    SELECT 'SELECT x.uid uid,' + 
       GROUP_CONCAT_UNQUOTED(
          'a' + STRING(aid) 
       ) 
       + ' FROM [mydataset.pivot_1_2000] AS x
    JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid
    '
    FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 4001 ORDER BY aid)
    

    Output string from above should be run and result written to mydataset.pivot_1_4000

    Then we repeat STEP 4 like below

    SELECT 'SELECT x.uid uid,' + 
       GROUP_CONCAT_UNQUOTED(
          'a' + STRING(aid) 
       ) 
       + ' FROM [mydataset.pivot_1_4000] AS x
    JOIN EACH [mydataset.pivot_4001_6000] AS y ON y.uid = x.uid
    '
    FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 6001 ORDER BY aid)
    

    Result to be written to mydataset.pivot_1_6000

    The resulted table has following schema:

    uid int, a1 int, a2 int, a3 int, . . . , a5999 int, a6000 int 
    

    NOTE:
    a. I tried this approach only up to 6000 features and it worked as expected
    b. Run time for second/main queries in step 3 and 4 varied from 20 to 60 min
    c. IMPORTANT: billing tier in steps 3 and 4 varied from 1 to 90. The good news is that respective table’s size is relatively small (30-40MB) so does billing bytes. For “before 2016” projects everything is billed as tier 1 but after October 2016 this can be an issue.
    For more information, see Timing in High-Compute queries
    d. Above example shows power of large-scale data transformation with BigQuery! Still I think (but I can be wrong) that storing materialized feature matrix is not the best idea

    0 讨论(0)
提交回复
热议问题