Rolling 90 days active users in BigQuery, improving preformance (DAU/MAU/WAU)

后端 未结 2 1638
名媛妹妹
名媛妹妹 2020-12-01 15:51

I\'m trying to get the number of unique events on a specific date, rolling 90/30/7 days back. I\'ve got this working on a limited number of rows with the query bellow but fo

相关标签:
2条回答
  • 2020-12-01 15:54

    Counting unique users requires a lot of resources, even more if you want results over a rolling window. For a scalable solution, look into approximate algorithms like HLL++:

    • https://medium.freecodecamp.org/counting-uniques-faster-in-bigquery-with-hyperloglog-5d3764493a5a

    For an exact count, this would work (but gets slower as the window gets larger):

    #standardSQL
    SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
     , COUNT(DISTINCT owner_user_id) unique_90_day_users
     , COUNT(DISTINCT IF(i<31,owner_user_id,null)) unique_30_day_users
     , COUNT(DISTINCT IF(i<8,owner_user_id,null)) unique_7_day_users
    FROM (
      SELECT DATE(creation_date) date, owner_user_id
      FROM `bigquery-public-data.stackoverflow.posts_questions` 
      WHERE EXTRACT(YEAR FROM creation_date)=2017
      GROUP BY 1, 2
    ), UNNEST(GENERATE_ARRAY(1, 90)) i
    GROUP BY 1
    ORDER BY date_grp
    

    The approximate solution produces results way faster (14s vs 366s, but then the results are approximate):

    #standardSQL
    SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
     , HLL_COUNT.MERGE(sketch) unique_90_day_users
     , HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
     , HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
    FROM (
      SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
      FROM `bigquery-public-data.stackoverflow.posts_questions` 
      WHERE EXTRACT(YEAR FROM creation_date)=2017
      GROUP BY 1
    ), UNNEST(GENERATE_ARRAY(1, 90)) i
    GROUP BY 1
    ORDER BY date_grp
    


    Updated query that gives correct results - removing rows with less than 90 days (works when no dates are missing):

    #standardSQL
    SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
     , HLL_COUNT.MERGE(sketch) unique_90_day_users
     , HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
     , HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
     , COUNT(*) window_days
    FROM (
      SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
      FROM `bigquery-public-data.stackoverflow.posts_questions` 
      WHERE EXTRACT(YEAR FROM creation_date)=2017
      GROUP BY 1
    ), UNNEST(GENERATE_ARRAY(1, 90)) i
    GROUP BY 1
    HAVING window_days=90
    ORDER BY date_grp
    
    0 讨论(0)
  • 2020-12-01 16:20

    You can aggregate the date and do the sum. What is the aggregation? Take the most recent date:

    select count(*) as num_users,
           sum(case when date > datediff(current_date, interval -30 day) then 1 else 0 end) as num_users_30days,
           sum(case when date > datediff(current_date, interval -60 day) then 1 else 0 end) as num_users_60days,
           sum(case when date > datediff(current_date, interval -90 day) then 1 else 0 end) as num_users_90days
    from (select user_id, max(date) as max(date)
          from `consumer.events` e
          group by user_id
         ) e;
    

    If the most recent date for the user is in the period, then the user should be counted.

    You can get this "as-of" a particular date by using a where clause in the subquery.

    0 讨论(0)
提交回复
热议问题