Firebase exported to BigQuery: retention cohorts query

后端未结

关注

 1  1992

Firebase offer split testing functionality through Firebase remote configuration, but there are lack of ability to filter retention in cohorts sections with user properties (wit

相关标签:

1条回答

夕颜

2021-02-02 01:10

Any tips and directions to go about building complex query which may aggregate and calculate all data required for this task in one step is very appreciated.

yes, generic bigquery will work fine

Below is not the most generic version, but can give you an idea
In this example I am using Stack Overflow Data available in Google BigQuery Public Datasets

First sub-select – activities – in most cases the only what you need to re-write to reflect specifics of your data.
What it does is:
a. Defines period you want to set for analysis.
In example below - it is a month - FORMAT_DATE('%Y-%m', ...
But you can use year, week, day or anything else – respectively
• By year - FORMAT_DATE('%Y', DATE(answers.creation_date)) AS period
• By week - FORMAT_DATE('%Y-%W', DATE(answers.creation_date)) AS period
• By day - FORMAT_DATE('%Y-%m-%d', DATE(answers.creation_date)) AS period
• …
b. Also it “filters” only the type of events/activity you need to analyse
for example, `WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%' looks for answers for google-bigquery tagged question

The rest of sub-queries are more-less generic and mostly can be used as is

#standardSQL
WITH activities AS (
  SELECT answers.owner_user_id AS id,
    FORMAT_DATE('%Y-%m', DATE(answers.creation_date)) AS period
  FROM `bigquery-public-data.stackoverflow.posts_answers` AS answers
  JOIN `bigquery-public-data.stackoverflow.posts_questions` AS questions
  ON questions.id = answers.parent_id
  WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%' 
  GROUP BY id, period
), cohorts AS (
  SELECT id, MIN(period) AS cohort FROM activities GROUP BY id
), periods AS (
  SELECT period, ROW_NUMBER() OVER(ORDER BY period) AS num
  FROM (SELECT DISTINCT cohort AS period FROM cohorts)
), cohorts_size AS (
  SELECT cohort, periods.num AS num, COUNT(DISTINCT activities.id) AS ids 
  FROM cohorts JOIN activities ON activities.period = cohorts.cohort AND cohorts.id = activities.id
  JOIN periods ON periods.period = cohorts.cohort
  GROUP BY cohort, num
), retention AS (
  SELECT cohort, activities.period AS period, periods.num AS num, COUNT(DISTINCT cohorts.id) AS ids
  FROM periods JOIN activities ON activities.period = periods.period
  JOIN cohorts ON cohorts.id = activities.id 
  GROUP BY cohort, period, num 
)
SELECT 
  CONCAT(cohorts_size.cohort, ' - ',  FORMAT("%'d", cohorts_size.ids), ' users') AS cohort, 
  retention.num - cohorts_size.num AS period_lag, 
  retention.period as period_label,
  ROUND(retention.ids / cohorts_size.ids * 100, 2) AS retention , retention.ids AS rids
FROM retention
JOIN cohorts_size ON cohorts_size.cohort = retention.cohort
WHERE cohorts_size.cohort >= FORMAT_DATE('%Y-%m', DATE('2015-01-01'))
ORDER BY cohort, period_lag, period_label

You can visualize result of above query with the tool of your choice
Note: you can use either period_lag or period_label
See the difference of their use in below examples

with period_lag

with period_label

0 讨论(0)