Firebase exported to BigQuery: retention cohorts query

后端 未结 1 1993
灰色年华
灰色年华 2021-02-02 00:52

Firebase offer split testing functionality through Firebase remote configuration, but there are lack of ability to filter retention in cohorts sections with user properties (wit

1条回答
  •  夕颜
    夕颜 (楼主)
    2021-02-02 01:10

    Any tips and directions to go about building complex query which may aggregate and calculate all data required for this task in one step is very appreciated.

    yes, generic bigquery will work fine

    Below is not the most generic version, but can give you an idea
    In this example I am using Stack Overflow Data available in Google BigQuery Public Datasets

    First sub-select – activities – in most cases the only what you need to re-write to reflect specifics of your data.
    What it does is:
    a. Defines period you want to set for analysis.
    In example below - it is a month - FORMAT_DATE('%Y-%m', ...
    But you can use year, week, day or anything else – respectively
    • By year - FORMAT_DATE('%Y', DATE(answers.creation_date)) AS period
    • By week - FORMAT_DATE('%Y-%W', DATE(answers.creation_date)) AS period
    • By day - FORMAT_DATE('%Y-%m-%d', DATE(answers.creation_date)) AS period
    • …
    b. Also it “filters” only the type of events/activity you need to analyse
    for example, `WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%' looks for answers for google-bigquery tagged question

    The rest of sub-queries are more-less generic and mostly can be used as is

    #standardSQL
    WITH activities AS (
      SELECT answers.owner_user_id AS id,
        FORMAT_DATE('%Y-%m', DATE(answers.creation_date)) AS period
      FROM `bigquery-public-data.stackoverflow.posts_answers` AS answers
      JOIN `bigquery-public-data.stackoverflow.posts_questions` AS questions
      ON questions.id = answers.parent_id
      WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%' 
      GROUP BY id, period
    ), cohorts AS (
      SELECT id, MIN(period) AS cohort FROM activities GROUP BY id
    ), periods AS (
      SELECT period, ROW_NUMBER() OVER(ORDER BY period) AS num
      FROM (SELECT DISTINCT cohort AS period FROM cohorts)
    ), cohorts_size AS (
      SELECT cohort, periods.num AS num, COUNT(DISTINCT activities.id) AS ids 
      FROM cohorts JOIN activities ON activities.period = cohorts.cohort AND cohorts.id = activities.id
      JOIN periods ON periods.period = cohorts.cohort
      GROUP BY cohort, num
    ), retention AS (
      SELECT cohort, activities.period AS period, periods.num AS num, COUNT(DISTINCT cohorts.id) AS ids
      FROM periods JOIN activities ON activities.period = periods.period
      JOIN cohorts ON cohorts.id = activities.id 
      GROUP BY cohort, period, num 
    )
    SELECT 
      CONCAT(cohorts_size.cohort, ' - ',  FORMAT("%'d", cohorts_size.ids), ' users') AS cohort, 
      retention.num - cohorts_size.num AS period_lag, 
      retention.period as period_label,
      ROUND(retention.ids / cohorts_size.ids * 100, 2) AS retention , retention.ids AS rids
    FROM retention
    JOIN cohorts_size ON cohorts_size.cohort = retention.cohort
    WHERE cohorts_size.cohort >= FORMAT_DATE('%Y-%m', DATE('2015-01-01'))
    ORDER BY cohort, period_lag, period_label  
    

    You can visualize result of above query with the tool of your choice
    Note: you can use either period_lag or period_label
    See the difference of their use in below examples

    with period_lag

    with period_label

    0 讨论(0)
提交回复
热议问题