Querying count on daily basis with date constraints over multiple weeks

后端 未结 1 578
粉色の甜心
粉色の甜心 2021-01-03 10:57

I\'m trying to find the # active users over time on a daily basis.

A user is active when he has made more than 10 requests per week for 4 co

1条回答
  •  花落未央
    2021-01-03 11:57

    Here is a long answer how to make your queries short. :)

    Table

    Building on my table (before you provided table definition with different (odd!) data types:

    CREATE TABLE requests (
       id           int
     , accounts_id  int  -- (id of the user)
     , recipient_id int  -- (id of the recipient)
     , date         date -- (date that the request was made in YYYY-MM-DD)
     , amount       int  -- (# of requests by accounts_id for the day)
    );
    

    Active Users for given day

    The list of "active users" for one given day:

    SELECT accounts_id
    FROM  (
       SELECT w.w, r.accounts_id
       FROM  (
          SELECT w
               , day - 6 - 7 * w AS w_start
               , day     - 7 * w AS w_end   
          FROM  (SELECT '2014-10-31'::date - 1 AS day) d  -- effective date here
               , generate_series(0,3) w
          ) w
       JOIN   requests r ON r."date" BETWEEN w_start AND w_end
       GROUP  BY w.w, r.accounts_id
       HAVING sum(r.amount) > 10
       ) sub
    GROUP  BY 1
    HAVING count(*) = 4;
    

    Step 1

    In the innermost subquery w (for "week") build the bounds of the 4 weeks of interest from a CROSS JOIN of the given day - 1 with the output of generate_series(0-3).

    To add / subtract days to / from a date (not from a timestamp!) just add / subtract integer numbers. The expression day - 7 * w subtracts 0-3 times 7 days from the given date, arriving at the end dates for each week (w_end).
    Subrtract another 6 days (not 7!) from each to compute the respective start (w_start).
    Additionally, keep the week number w (0-3) for the later aggregation.

    Step 2

    In subquery sub join rows from requests to the set of 4 weeks, where the date lies between start and end date. GROUP BY the week number w and the accounts_id.
    Only weeks with more than 10 requests total qualify.

    Step 3

    In the outer SELECT count the number of weeks each user (accounts_id) qualified. Must be 4 to qualify as "active user"

    Count of active users per day

    This is dynamite.
    Wrapped in in a simple SQL function to simplify general use, but the query can be used on its own just as well:

    CREATE FUNCTION f_active_users (_now date = now()::date, _days int = 3)
      RETURNS TABLE (day date, users int) AS
    $func$
    WITH r AS (
       SELECT accounts_id, date, sum(amount)::int AS amount
       FROM   requests
       WHERE  date BETWEEN _now - (27 + _days) AND _now - 1
       GROUP  BY accounts_id, date
       )
    SELECT date + 1, count(w_ct = 4 OR NULL)::int
    FROM  (
       SELECT accounts_id, date
            , count(w_amount > 10 OR NULL)
                             OVER (PARTITION BY accounts_id, dow ORDER BY date DESC
                             ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) AS w_ct
       FROM  (
          SELECT accounts_id, date, dow   
               , sum(amount) OVER (PARTITION BY accounts_id ORDER BY date DESC
                             ROWS BETWEEN CURRENT ROW AND 6 FOLLOWING) AS w_amount
          FROM  (SELECT _now - i AS date, i%7 AS dow
                 FROM   generate_series(1, 27 + _days) i) d -- period of interest
          CROSS  JOIN (
                 SELECT accounts_id FROM r
                 GROUP  BY 1
                 HAVING count(*) > 3 AND sum(amount) > 39  -- enough rows & requests
                 AND    max(date) > min(date) + 15) a      -- can cover 4 weeks
          LEFT   JOIN r USING (accounts_id, date)
          ) sub1
       WHERE date > _now - (22 + _days)  -- cut off 6 trailing days now - useful?
       ) sub2
    GROUP  BY date
    ORDER  BY date DESC
    LIMIT  _days
    $func$ LANGUAGE sql STABLE;
    

    The function takes any day (_now), "today" by default, and the number of days (_days) in the result, 3 by default. Call:

    SELECT * FROM f_active_users('2014-10-31', 5);
    

    Or without parameters to use defaults:

    SELECT * FROM f_active_users();
    

    The approach is different from the first query.

    SQL Fiddle with both queries and variants for your table definition.

    Step 0

    In the CTE r pre-aggregate amounts per (accounts_id, date) for only the period of interest, for better performance. The table is only scanned once, the suggested index (see blow) will kick in here.

    Step 1

    In the inner subquery d generate the necessary list of days: 27 + _days rows, where _days is the desired number of rows in the output, effectively 28 days or more.
    While being at it, compute the day of the week (dow) to be used for aggregating in step 3. i%7 coincides with weekly intervals, the query works for any interval, though.

    In the inner subquery a generate a unique list of users (accounts_id) that exist in CTE r and pass some first superficial tests (sufficient rows spanning sufficient time with sufficient total requests).

    Step 2

    Generate a Cartesian product from d and a with a CROSS JOIN to have one row for every relevant day for every relevant user. LEFT JOIN to r to append the amount of requests (if any). No WHERE condition, we want every day in the result, even if there are no active users at all.

    Compute the total amount for the past week (w_amount) in the same step using a Window functions with a custom frame. Example:

    • How to use a ring data structure in window functions

    Step 3

    Cut off the last 6 days now; which is optional and may or may not help performance. Test it: WHERE date >= _now - (21 + _days)

    Count the weeks where the minimum amount is met (w_ct) in a similar window function, this time partitioned by dow additionally to only have same weekdays for the past 4 weeks in the frame (which carry the sum of the respective past week). The expression count(w_amount > 10 OR NULL) only counts rows with more than 10 requests. Detailed explanation:

    • Compute percents from SUM() in the same SELECT sql query

    Step 4

    In the outer SELECT group by date and count users that passed all 4 weeks (count(w_ct = 4 OR NULL)). Add 1 to the date to compensate off-by-1, ORDER and LIMIT to the requested number of days.

    Performance and outlook

    The perfect index for both queries would be:

    CREATE INDEX foo ON requests (date, accounts_id, amount);
    

    Performance should be good, but get even (much) better with the upcoming Postgres 9.4, due to the new moving aggregate support:

    Moving-aggregate support in the Postgres Wiki.
    Moving aggregates in the 9.4 manual

    Aside: don't call a timestamp column "date", it's a timestamp, not a date. Better yet, never use basic type names like date or timestamp as identifier. Ever.

    0 讨论(0)
提交回复
热议问题