问题
I got stuck with a SQL problem. Let's say we have a dataset like this in Redshift :
account_id day event_id
111 2019-01-01 1000
111 2019-01-02 1001
111 2019-01-02 1002
111 2019-01-10 1003
111 2019-01-25 1004
111 2019-02-05 1005
111 2019-02-24 1006
111 2019-02-28 1007
111 2019-03-02 1008
111 2019-03-15 1009
222 2019-01-01 1000
222 2019-01-02 1001
222 2019-01-02 1002
222 2019-01-10 1003
222 2019-01-25 1004
222 2019-02-05 1005
222 2019-02-24 1006
222 2019-02-28 1007
222 2019-03-02 1008
222 2019-03-15 1009
I need to pick event_ids that happen after 30 days of window PER ACCOUNT_ID, but then change starting date of new window based on the first event date I find.
so in this case FOR BOTH ACCOUNT_IDS 111 and 222:
- we pick first event_id = 1000, and then we should ignore everything until 1st of February (30 days)
- then we pick event_id = 1005, and we should ignore everything until 5th of March (since event_id = 1005 happened on 5th of February)
- then we pick up event_id = 1009 on 15th of March, and we should ignore everything until 15th of April...
you get the picture..
How to do this?
回答1:
I also couldn't find a solution purely based on window functions.
But in PostgreSql a recursive CTE works for this.
The temp table is used to have an id that can be used to connect to the next record.
CREATE TEMPORARY TABLE tempEventDates (
id SERIAL primary key,
account_id int not null,
day date not null,
min_day date not null,
event_id int not null
);
INSERT INTO tempEventDates (account_id, day, min_day, event_id)
SELECT account_id, day,
MIN(day) OVER (PARTITION BY account_id) as min_day, event_id
FROM yourtable
GROUP BY account_id, day, event_id
ORDER BY account_id, day, event_id;
WITH RECURSIVE RCTE AS
(
SELECT id, account_id, event_id, day, min_day
FROM tempEventDates
WHERE day = min_day
UNION ALL
SELECT t.id, t.account_id, t.event_id, t.day,
CASE WHEN t.day > c.min_day + interval '30 days' THEN t.day ELSE c.min_day END
FROM RCTE c
JOIN tempEventDates t
ON t.account_id = c.account_id
AND t.id = c.id + 1
)
SELECT account_id, day, event_id
FROM RCTE
WHERE day = min_day
ORDER BY account_id, day;
A test on rextester here
回答2:
I can hardly see any solution based on pure window functions since subsequent rows depend on previous rows in such extent where the strength of window functions IMHO does not suffice.
Here is PG solution based on recursive query:
with recursive t (day,event_id) as (
select date '2019-01-01', 1000 union
select date '2019-01-02', 1001 union
select date '2019-01-02', 1002 union
select date '2019-01-10', 1003 union
select date '2019-01-25', 1004 union
select date '2019-02-05', 1005 union
select date '2019-02-24', 1006 union
select date '2019-02-28', 1007 union
select date '2019-03-02', 1008 union
select date '2019-03-15', 1009
), rec (day, event_id) as (
select t.* from t where day = (select min(day) from t)
union all
select tl.* from rec, lateral (select * from t where t.day > rec.day + interval '30 days' order by t.day limit 1) tl
)
select * from rec order by day;
UPDATE after specification change (account_id addition):
with recursive t (account_id,day,event_id) as (
select 111, date '2019-01-01', 1000 union
select 111, date '2019-01-02', 1001 union
select 111, date '2019-01-02', 1002 union
select 111, date '2019-01-10', 1003 union
select 111, date '2019-01-25', 1004 union
select 111, date '2019-02-05', 1005 union
select 111, date '2019-02-24', 1006 union
select 111, date '2019-02-28', 1007 union
select 111, date '2019-03-02', 1008 union
select 111, date '2019-03-15', 1009 union
select 222, date '2019-01-01', 1000 union
select 222, date '2019-01-02', 1001 union
select 222, date '2019-01-02', 1002 union
select 222, date '2019-01-10', 1003 union
select 222, date '2019-01-25', 1004 union
select 222, date '2019-02-05', 1005 union
select 222, date '2019-02-24', 1006 union
select 222, date '2019-02-28', 1007 union
select 222, date '2019-03-02', 1008 union
select 222, date '2019-03-15', 1009
), seed as (
select t.*, row_number() over (partition by t.account_id order by day) as rn
from t
), rec (account_id, day, event_id) as (
select account_id, day, event_id
from seed
where rn = 1
union all
select tl.*
from rec,
lateral (
select *
from t
where t.account_id = rec.account_id
and t.day > rec.day + interval '30 days'
order by t.day
limit 1
) tl
)
select *
from rec
order by account_id, day;
来源:https://stackoverflow.com/questions/59006588/how-to-ignore-rows-with-moving-30-day-interval