Cohort analysis in SQL

前端 未结 4 1633
慢半拍i
慢半拍i 2020-12-09 13:12

Looking to do some cohort analysis on a userbase. We have 2 tables \"users\" and \"sessions\", where users and sessions both have a \"created_at\" field. I\'m looking to f

相关标签:
4条回答
  • 2020-12-09 13:41

    This answer inverts the output table that @Newy wanted so the cohorts are the rows instead of the columns, and uses absolute dates instead of relative ones.

    I was looking for a query that would give me something like this:

    Date        d0  d1  d2  d3  d4  d5  d6
    2016-11-03  3   1   0   0   0   0   0
    2016-11-04  4   2   0   1   0   0   *
    2016-11-05  7   0   1   1   0   *   *
    2016-11-06  7   3   1   1   *   *   *
    2016-11-07  13  5   1   *   *   *   *
    2016-11-08  4   0   *   *   *   *   *
    2016-11-09  1   *   *   *   *   *   *
    

    I was looking for the number of users that signed up a certain date, then how many of those users returned 1 day later, 2 days later, etc. So on 2016-11-07 13 users signed up and had a session, then 5 of those users came back 1 day later, then one user came back 2 days later, etc.

    I took the first subquery of @Andriy M's large query and modified it to give me the date a user signed up, not the days relative to the current date:

    SELECT
        id,
        DATE(created_at) AS DayOffset
      FROM users
      WHERE created_at >= CURDATE() - INTERVAL 6 DAY
    

    Then the LEFT JOIN subquery I modified to look like this:

     SELECT DISTINCT
        sessions.user_id,
        DATEDIFF(sessions.created_at, user.created_at) AS DayOffset
        FROM sessions
        LEFT JOIN users ON (users.id = sessions.user_id)
        WHERE sessions.created_at >= CURDATE() - INTERVAL 6 DAY
    

    I wanted the dayoffset not relative to the current date as in @Andriy M's answer, but relative to the date the user signed up. So I did left join on the user table to get the time the user signed up and did a date diff on that.

    So the final query looks something like this:

    SELECT u.DayOffset as Date,
      SUM(s.DayOffset = 0) AS d0,
      SUM(s.DayOffset = 1) AS d1,
      SUM(s.DayOffset = 2) AS d2,
      SUM(s.DayOffset = 3) AS d3,
      SUM(s.DayOffset = 4) AS d4,
      SUM(s.DayOffset = 5) AS d5,
      SUM(s.DayOffset = 6) AS d6
    FROM (
     SELECT
        id,
        DATE(created_at) AS DayOffset
      FROM users
      WHERE created_at >= CURDATE() - INTERVAL 6 DAY
    ) as u
    LEFT JOIN (
        SELECT DISTINCT
        sessions.user_id,
        DATEDIFF(sessions.created_at, user.created_at) AS DayOffset
        FROM sessions
        LEFT JOIN users ON (users.id = sessions.user_id)
        WHERE sessions.created_at >= CURDATE() - INTERVAL 6 DAY
    ) as s
    ON s.user = u.id
    GROUP BY u.DayOffset
    
    0 讨论(0)
  • 2020-12-09 13:47

    Monthly cohort based on @Newy response:

    SELECT u.MonthOffset AS MONTH,
    
      SUM(s.MonthOffset = 0) AS m0,
      SUM(s.MonthOffset = 1) AS m1,
      SUM(s.MonthOffset = 2) AS m2,
      SUM(s.MonthOffset = 3) AS m3,
      SUM(s.MonthOffset = 4) AS m4,
      SUM(s.MonthOffset = 5) AS m5,
      SUM(s.MonthOffset = 6) AS m6
    FROM (
     SELECT
        id,
        TIMESTAMPDIFF(month, DATE(date), CURDATE()) AS MonthOffset
      FROM users
      WHERE date >= CURDATE() - INTERVAL 6 month
    ) AS u
    LEFT JOIN (
        SELECT DISTINCT
        user_id,
        TIMESTAMPDIFF(month, DATE(date), CURDATE()) AS MonthOffset
        FROM sessions
        WHERE sessions.date >= CURDATE() - INTERVAL 6 month
    ) AS s
    ON s.user_id = u.id
    GROUP BY u.MonthOffset;  
    
    0 讨论(0)
  • 2020-12-09 13:52

    This seems a complex problem. Regardless of whether it also seems to you a difficult one or not, it is never a bad idea to start working it up from a smaller problem.

    You could start, for instance, with a query returning all the users (just the users) that have been registered within the last week, i.e. starting from the day six days from now, as per your requirement:

    SELECT *
    FROM users
    WHERE created_at >= CURDATE() - INTERVAL 6 DAY
    

    The next step could be grouping the results by dates and counting rows in every group:

    SELECT
      created_at,
      COUNT(*) AS user_count
    FROM users
    WHERE created_at >= CURDATE() - INTERVAL 6 DAY
    GROUP BY created_at
    

    If created_at is a datetime or timestamp, use DATE(created_at) as the grouping criterion:

    SELECT
      DATE(created_at) AS created_at,
      COUNT(*) AS user_count
    FROM users
    WHERE created_at >= CURDATE() - INTERVAL 6 DAY
    GROUP BY DATE(created_at)
    

    However, you don't seem to want absolute dates in the output, but only relative ones, like today, today - 1 day etc. In that case, you could use the DATEDIFF() function, which returns the number of days between two dates, to produce (numeric) offsets from today and group by those values:

    SELECT
      DATEDIFF(CURDATE(), created_at) AS created_at,
      COUNT(*) AS user_count
    FROM users
    WHERE created_at >= CURDATE() - INTERVAL 6 DAY
    GROUP BY DATE(created_at)
    

    Your created_at column would contain "dates" like 0, 1 and so on till 6. Converting them into today, today-1 etc. is trivial and you will see that in the final query. So far, however, we've reached the point at which we need to take one step back (or, perhaps, it's rather a half step to the right), because we don't really need to count the users but rather their returns. So, the actual working dataset from users that is needed at the moment will be this:

    SELECT
      id,
      DATEDIFF(CURDATE(), created_at) AS day_offset
    FROM users
    WHERE created_at >= CURDATE() - INTERVAL 6 DAY
    

    We need user IDs to join this rowset to (the one that will be derived from) sessions and we need day_offset as the grouping criterion.

    Moving on, a similar transformation will need to be performed on the sessions table, and I won't go into details on that. Suffice it to say that the resulting query will be very identical to the last one, with just two exception:

    • id gets replaced with user_id;

    • DISTINCT is applied to the entire subset.

    The reason for DISTINCT is to return no more than one row per user & day: it is my understanding that however many sessions a user might have on a particular day, you want to count them as one return. So, here's what gets derived from sessions:

    SELECT DISTINCT
      user_id,
      DATEDIFF(CURDATE(), created_at) AS day_offset
    FROM sessions
    WHERE created_at >= CURDATE() - INTERVAL 6 DAY
    

    Now it only remains to join the two derived tables, apply grouping and use conditional aggregation to get the required results:

    SELECT
      CONCAT('today', IFNULL(CONCAT('-', NULLIF(u.DayOffset, 0)), '')) AS created_at,
      SUM(s.DayOffset = 0) AS d0,
      SUM(s.DayOffset = 1) AS d1,
      SUM(s.DayOffset = 2) AS d2,
      SUM(s.DayOffset = 3) AS d3,
      SUM(s.DayOffset = 4) AS d4,
      SUM(s.DayOffset = 5) AS d5,
      SUM(s.DayOffset = 6) AS d6
    FROM (
      SELECT
        id,
        DATEDIFF(CURDATE(), created_at) AS DayOffset
      FROM users
      WHERE created_at >= CURDATE() - INTERVAL 6 DAY
    ) u
    LEFT JOIN (
      SELECT DISTINCT
        user_id,
        DATEDIFF(CURDATE(), created_at) AS DayOffset
      FROM sessions
      WHERE created_at >= CURDATE() - INTERVAL 6 DAY
    ) s
    ON u.id = s.user_id
    GROUP BY u.DayOffset
    ;
    

    I must admit that I haven't tested/debugged this, but, if this be needed, I'll be happy to work with the data samples you will have provided, once you have provided them. :)

    0 讨论(0)
  • 2020-12-09 13:59

    Example Of Month Wise Cohort:

    First Let's Create Table Individual User Activity Flow (MONTH WISE):

    SELECT 
        mu.created_timestamp AS cohort
        , mu.id AS user_id
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 1 AND l.user_id = mu.id) AS m1
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 2 AND l.user_id = mu.id) AS m2
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 3 AND l.user_id = mu.id) AS m3
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 4 AND l.user_id = mu.id) AS m4
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 5 AND l.user_id = mu.id) AS m5
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 6 AND l.user_id = mu.id) AS m6
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 7 AND l.user_id = mu.id) AS m7
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 8 AND l.user_id = mu.id) AS m8
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 9 AND l.user_id = mu.id) AS m9
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 10 AND l.user_id = mu.id) AS m10
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 11 AND l.user_id = mu.id) AS m11
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 12 AND l.user_id = mu.id) AS m12
    FROM user mu 
    WHERE mu.created_timestamp BETWEEN '2018-01-01 00:00:00' AND '2019-12-31 23:59:59'
    

    Then After This Table Calculate the individual activity-sum of the user:

    SELECT MONTH(c.cohort) AS cohort
           ,COUNT(c.user_id) AS signups
           ,SUM(c.m1) AS m1 
           ,SUM(c.m2) AS m2 
           ,SUM(c.m3) AS m3 
           ,SUM(c.m4) AS m4 
           ,SUM(c.m5) AS m5 
           ,SUM(c.m6) AS m6 
           ,SUM(c.m7) AS m7 
           ,SUM(c.m8) AS m8 
           ,SUM(c.m9) AS m9 
           ,SUM(c.m10) AS m10 
           ,SUM(c.m11) AS m11 
           ,SUM(c.m12) AS m12 
    FROM (SELECT 
        mu.created_timestamp AS cohort
        , mu.id AS user_id
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 1 AND l.user_id = mu.id) AS m1
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 2 AND l.user_id = mu.id) AS m2
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 3 AND l.user_id = mu.id) AS m3
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 4 AND l.user_id = mu.id) AS m4
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 5 AND l.user_id = mu.id) AS m5
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 6 AND l.user_id = mu.id) AS m6
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 7 AND l.user_id = mu.id) AS m7
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 8 AND l.user_id = mu.id) AS m8
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 9 AND l.user_id = mu.id) AS m9
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 10 AND l.user_id = mu.id) AS m10
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 11 AND l.user_id = mu.id) AS m11
        ,(SELECT IF(COUNT(l.order_date) = 0 , 0, 1) FROM order l WHERE MONTH(l.order_date) = 12 AND l.user_id = mu.id) AS m12
    FROM user mu 
    WHERE mu.created_timestamp BETWEEN '2018-01-01 00:00:00' AND '2019-12-31 23:59:59') AS c GROUP BY MONTH(cohort)
    

    In replacement of months you can use days, other wise cohort analysis mostly use in month cases

    0 讨论(0)
提交回复
热议问题