Calculating follower growth over time for each influencer

问题

I have a table with influencers and their follower counter for each day:

influencer_id |     date     |    followers
     1        | 2020-05-29   |      7361
     1        | 2020-05-28   |      7234
                    ...
     2        | 2020-05-29   |       82
     2        | 2020-05-28   |       85
                    ...
     3        | 2020-05-29   |      3434
     3        | 2020-05-28   |      2988
     3        | 2020-05-27   |      2765
                    ...

Let's say I want to calculate how many followers each individual influencer has gained in the last 7 days and get the following table:

influencer_id |                       growth
     1        |  <num followers last day - num followers first day>
     2        |                         "
     3        |                         "

As a first attempt I did this:

SELECT influencer_id,
      (MAX(followers) - MIN(followers)) AS growth
FROM influencer_follower_daily
WHERE date < '2020-05-30'
AND date >= '2020-05-23'
GROUP BY influencer_id;

This works and shows the growth over the week for each influencer. But it assumes the follower count always increases and people never unfollow!

So is there a way to achieve what I want using an SQL query over the original table? Or will I have to generate a completely new table using a FOR loop that calculates a +/- follower change column between each date?

回答1:

The simple aggregate functions first() and last() are not implemented in standard Postgres. But see below.

1. `array_agg()`

Gordon demonstrated a query with array_agg(), but that's more expensive than necessary, especially with many rows per group. Even more so when called twice, and with ORDER BY per aggregate. This equivalent alternative should be substantially faster:

SELECT influencer_id, arr[array_upper(arr, 1)] - arr[1]
FROM  (
   SELECT influencer_id, array_agg(followers) AS arr
   FROM  (
      SELECT influencer_id, followers
      FROM   influencer_follower_daily
      WHERE  date >= '2020-05-23'
      AND    date <  '2020-05-30'
      ORDER  BY influencer_id, date
      ) sub1
   GROUP  BY influencer_id
   ) sub2;

Because it sorts once and aggregates once. The sort order of the inner subquery sub1 is carried over to the next level. See:

How to apply ORDER BY and LIMIT in combination with an aggregate function?

Indexes matter:

If you query the whole table or most of it, an index on (influencer_id, date, followers) can help (a lot) with index-only scans.
If you query only a small fragment of the table, an index on (date) or (date, influencer_id, followers) can help (a lot).

2. `DISTINCT` & window functions

Gordon also demonstrated DISTINCT with window functions. Again, can be substantially faster:

SELECT DISTINCT ON (influencer_id)
       influencer_id
     , last_value(followers) OVER (PARTITION BY influencer_id ORDER BY date
                                   ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
     - followers AS growth
FROM   influencer_follower_daily
WHERE  date >= '2020-05-23'
AND    date <  '2020-05-30'
ORDER  BY influencer_id, date;

With a single window function, using the same sort order (!) as the main query. To achieve this, we need the non-default window definition with ROWS BETWEEN ... See:

PostgreSQL query with max and min date plus associated id per row

And DISTINCT ON instead of DISTINCT. See:

Select first row in each GROUP BY group?

3. Custom aggregate functions

`first()` and `last()`

You can add those yourself, it's pretty simple. See instructions in the Postgres Wiki.
Or install the additional module first_last_agg with a faster implementation in C.

Use something like TOP with GROUP BY

Then your query becomes simpler:

SELECT influencer_id, last(followers) - first(followers) AS growth
FROM  (
   SELECT influencer_id, followers
   FROM   influencer_follower_daily 
   WHERE  date >= '2020-03-02'
   AND    date <  '2020-05-09'
   ORDER  BY influencer_id, date
   ) z
GROUP  BY influencer_id
ORDER  BY influencer_id;

Custom aggregate `growth()`

You can combine first() and last() in a single aggregate function. That's faster, but calling two C functions will still outperform one custom SQL function.

Basically encapsulates the logic of my first query in a custom aggregate:

CREATE OR REPLACE FUNCTION f_growth(anyarray)
  RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE AS
'SELECT $1[array_upper($1, 1)] - $1[1]';

CREATE OR REPLACE AGGREGATE growth(anyelement) (
   SFUNC     = array_append
 , STYPE     = anyarray
 , FINALFUNC = f_growth
 , PARALLEL  = SAFE
);

Works for any numeric type (or any type with an operator type - type returning the same type). The query is simpler, yet:

SELECT influencer_id, growth(followers)
FROM  (
   SELECT influencer_id, followers
   FROM   influencer_follower_daily 
   WHERE  date >= '2020-05-23'
   AND    date <  '2020-05-30'
   ORDER  BY influencer_id, date
   ) z
GROUP  BY influencer_id
ORDER  BY influencer_id;

Or a little slower, but ultimately short:

SELECT influencer_id, growth(followers ORDER BY date)
FROM   influencer_follower_daily 
WHERE  date >= '2020-05-23'
AND    date <  '2020-05-30'
GROUP  BY 1
ORDER  BY 1;

db<>fiddle here

4. Performance optimization for many rows per group

With many rows per group / partition, other query techniques can be (a lot) faster. Techniques along these lines:

Optimize GROUP BY query to retrieve latest row per user

If that applies, I suggest you start a new question disclosing exact table definition(s) and cardinalities ...

Closely related:

Get values from first and last row per group
PostgreSQL: joining arrays within group by clause
Use something like TOP with GROUP BY
Best performance in sampling repeated value from a grouped column

回答2:

Postgres doesn't have a first()/last() aggregation function. One method is:

SELECT DISTINCT influencer_id,
       ( FIRST_VALUE(followers) OVER (PARTITION BY influencer_id ORDER BY DATE DESC) -
         FIRST_VALUE(followers) OVER (PARTITION BY influencer_id ORDER BY DATE ASC)
       ) as growth
FROM influencer_follower_daily
WHERE date < '2020-05-30' AND date >= '2020-05-23';

Another alternative is to use arrays:

SELECT influencer_id,
       ( ARRAY_AGG(followers ORDER BY DATE DESC) )[1] -
         ARRAY_AGG(followers ORDER BY DATE ASC) )[1]
       ) as growth
FROM influencer_follower_daily
WHERE date < '2020-05-30' AND date >= '2020-05-23'
GROUP BY influencer_id;

来源：https://stackoverflow.com/questions/62156341/calculating-follower-growth-over-time-for-each-influencer

标签

sql