问题
I have a table with influencers and their follower counter for each day:
influencer_id | date | followers
1 | 2020-05-29 | 7361
1 | 2020-05-28 | 7234
...
2 | 2020-05-29 | 82
2 | 2020-05-28 | 85
...
3 | 2020-05-29 | 3434
3 | 2020-05-28 | 2988
3 | 2020-05-27 | 2765
...
Let's say I want to calculate how many followers each individual influencer has gained in the last 7 days and get the following table:
influencer_id | growth
1 | <num followers last day - num followers first day>
2 | "
3 | "
As a first attempt I did this:
SELECT influencer_id,
(MAX(followers) - MIN(followers)) AS growth
FROM influencer_follower_daily
WHERE date < '2020-05-30'
AND date >= '2020-05-23'
GROUP BY influencer_id;
This works and shows the growth over the week for each influencer. But it assumes the follower count always increases and people never unfollow!
So is there a way to achieve what I want using an SQL query over the original table? Or will I have to generate a completely new table using a FOR
loop that calculates a +/- follower change column between each date?
回答1:
The simple aggregate functions first()
and last()
are not implemented in standard Postgres. But see below.
1. array_agg()
Gordon demonstrated a query with array_agg()
, but that's more expensive than necessary, especially with many rows per group. Even more so when called twice, and with ORDER BY
per aggregate. This equivalent alternative should be substantially faster:
SELECT influencer_id, arr[array_upper(arr, 1)] - arr[1]
FROM (
SELECT influencer_id, array_agg(followers) AS arr
FROM (
SELECT influencer_id, followers
FROM influencer_follower_daily
WHERE date >= '2020-05-23'
AND date < '2020-05-30'
ORDER BY influencer_id, date
) sub1
GROUP BY influencer_id
) sub2;
Because it sorts once and aggregates once. The sort order of the inner subquery sub1
is carried over to the next level. See:
- How to apply ORDER BY and LIMIT in combination with an aggregate function?
Indexes matter:
If you query the whole table or most of it, an index on
(influencer_id, date, followers)
can help (a lot) with index-only scans.If you query only a small fragment of the table, an index on
(date)
or(date, influencer_id, followers)
can help (a lot).
2. DISTINCT
& window functions
Gordon also demonstrated DISTINCT
with window functions. Again, can be substantially faster:
SELECT DISTINCT ON (influencer_id)
influencer_id
, last_value(followers) OVER (PARTITION BY influencer_id ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
- followers AS growth
FROM influencer_follower_daily
WHERE date >= '2020-05-23'
AND date < '2020-05-30'
ORDER BY influencer_id, date;
With a single window function, using the same sort order (!) as the main query. To achieve this, we need the non-default window definition with ROWS BETWEEN ...
See:
- PostgreSQL query with max and min date plus associated id per row
And DISTINCT ON
instead of DISTINCT
. See:
- Select first row in each GROUP BY group?
3. Custom aggregate functions
first()
and last()
You can add those yourself, it's pretty simple. See instructions in the Postgres Wiki.
Or install the additional module first_last_agg with a faster implementation in C.
Related:
- Use something like TOP with GROUP BY
Then your query becomes simpler:
SELECT influencer_id, last(followers) - first(followers) AS growth
FROM (
SELECT influencer_id, followers
FROM influencer_follower_daily
WHERE date >= '2020-03-02'
AND date < '2020-05-09'
ORDER BY influencer_id, date
) z
GROUP BY influencer_id
ORDER BY influencer_id;
Custom aggregate growth()
You can combine first()
and last()
in a single aggregate function. That's faster, but calling two C functions will still outperform one custom SQL function.
Basically encapsulates the logic of my first query in a custom aggregate:
CREATE OR REPLACE FUNCTION f_growth(anyarray)
RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE AS
'SELECT $1[array_upper($1, 1)] - $1[1]';
CREATE OR REPLACE AGGREGATE growth(anyelement) (
SFUNC = array_append
, STYPE = anyarray
, FINALFUNC = f_growth
, PARALLEL = SAFE
);
Works for any numeric type (or any type with an operator type - type
returning the same type). The query is simpler, yet:
SELECT influencer_id, growth(followers)
FROM (
SELECT influencer_id, followers
FROM influencer_follower_daily
WHERE date >= '2020-05-23'
AND date < '2020-05-30'
ORDER BY influencer_id, date
) z
GROUP BY influencer_id
ORDER BY influencer_id;
Or a little slower, but ultimately short:
SELECT influencer_id, growth(followers ORDER BY date)
FROM influencer_follower_daily
WHERE date >= '2020-05-23'
AND date < '2020-05-30'
GROUP BY 1
ORDER BY 1;
db<>fiddle here
4. Performance optimization for many rows per group
With many rows per group / partition, other query techniques can be (a lot) faster. Techniques along these lines:
- Optimize GROUP BY query to retrieve latest row per user
If that applies, I suggest you start a new question disclosing exact table definition(s) and cardinalities ...
Closely related:
- Get values from first and last row per group
- PostgreSQL: joining arrays within group by clause
- Use something like TOP with GROUP BY
- Best performance in sampling repeated value from a grouped column
回答2:
Postgres doesn't have a first()
/last()
aggregation function. One method is:
SELECT DISTINCT influencer_id,
( FIRST_VALUE(followers) OVER (PARTITION BY influencer_id ORDER BY DATE DESC) -
FIRST_VALUE(followers) OVER (PARTITION BY influencer_id ORDER BY DATE ASC)
) as growth
FROM influencer_follower_daily
WHERE date < '2020-05-30' AND date >= '2020-05-23';
Another alternative is to use arrays:
SELECT influencer_id,
( ARRAY_AGG(followers ORDER BY DATE DESC) )[1] -
ARRAY_AGG(followers ORDER BY DATE ASC) )[1]
) as growth
FROM influencer_follower_daily
WHERE date < '2020-05-30' AND date >= '2020-05-23'
GROUP BY influencer_id;
来源:https://stackoverflow.com/questions/62156341/calculating-follower-growth-over-time-for-each-influencer