How to ignore rows with moving 30 day interval?

梦想的初衷 提交于 2021-02-08 07:52:38

问题


I got stuck with a SQL problem. Let's say we have a dataset like this in Redshift :

account_id  day          event_id
111         2019-01-01   1000
111         2019-01-02   1001
111         2019-01-02   1002
111         2019-01-10   1003
111         2019-01-25   1004
111         2019-02-05   1005
111         2019-02-24   1006
111         2019-02-28   1007
111         2019-03-02   1008
111         2019-03-15   1009
222         2019-01-01   1000
222         2019-01-02   1001
222         2019-01-02   1002
222         2019-01-10   1003
222         2019-01-25   1004
222         2019-02-05   1005
222         2019-02-24   1006
222         2019-02-28   1007
222         2019-03-02   1008
222         2019-03-15   1009

I need to pick event_ids that happen after 30 days of window PER ACCOUNT_ID, but then change starting date of new window based on the first event date I find.

so in this case FOR BOTH ACCOUNT_IDS 111 and 222:

  • we pick first event_id = 1000, and then we should ignore everything until 1st of February (30 days)
  • then we pick event_id = 1005, and we should ignore everything until 5th of March (since event_id = 1005 happened on 5th of February)
  • then we pick up event_id = 1009 on 15th of March, and we should ignore everything until 15th of April...

you get the picture..

How to do this?


回答1:


I also couldn't find a solution purely based on window functions.

But in PostgreSql a recursive CTE works for this.

The temp table is used to have an id that can be used to connect to the next record.

CREATE TEMPORARY TABLE tempEventDates (
 id SERIAL primary key, 
 account_id int not null,
 day date not null,
 min_day date not null,
 event_id int not null
);

INSERT INTO tempEventDates (account_id, day, min_day, event_id)
SELECT account_id, day,
MIN(day) OVER (PARTITION BY account_id) as min_day, event_id
FROM yourtable
GROUP BY account_id, day, event_id
ORDER BY account_id, day, event_id;

WITH RECURSIVE RCTE AS
(
    SELECT id, account_id, event_id, day, min_day
    FROM tempEventDates
    WHERE day = min_day

    UNION ALL

    SELECT t.id, t.account_id, t.event_id, t.day, 
     CASE WHEN t.day > c.min_day + interval '30 days' THEN t.day ELSE c.min_day END
    FROM RCTE c
    JOIN tempEventDates t 
      ON t.account_id = c.account_id 
     AND t.id = c.id + 1
)
SELECT account_id, day, event_id
FROM RCTE
WHERE day = min_day
ORDER BY account_id, day;

A test on rextester here




回答2:


I can hardly see any solution based on pure window functions since subsequent rows depend on previous rows in such extent where the strength of window functions IMHO does not suffice.

Here is PG solution based on recursive query:

with recursive t (day,event_id) as (
  select date '2019-01-01', 1000 union
  select date '2019-01-02', 1001 union
  select date '2019-01-02', 1002 union
  select date '2019-01-10', 1003 union
  select date '2019-01-25', 1004 union
  select date '2019-02-05', 1005 union
  select date '2019-02-24', 1006 union
  select date '2019-02-28', 1007 union
  select date '2019-03-02', 1008 union
  select date '2019-03-15', 1009
), rec (day, event_id) as (
  select t.* from t where day = (select min(day) from t)
  union all
  select tl.* from rec, lateral (select * from t where t.day > rec.day + interval '30 days' order by t.day limit 1) tl
)
select * from rec order by day;

UPDATE after specification change (account_id addition):

with recursive t (account_id,day,event_id) as (
  select 111, date '2019-01-01', 1000 union
  select 111, date '2019-01-02', 1001 union
  select 111, date '2019-01-02', 1002 union
  select 111, date '2019-01-10', 1003 union
  select 111, date '2019-01-25', 1004 union
  select 111, date '2019-02-05', 1005 union
  select 111, date '2019-02-24', 1006 union
  select 111, date '2019-02-28', 1007 union
  select 111, date '2019-03-02', 1008 union
  select 111, date '2019-03-15', 1009 union
  select 222, date '2019-01-01', 1000 union
  select 222, date '2019-01-02', 1001 union
  select 222, date '2019-01-02', 1002 union
  select 222, date '2019-01-10', 1003 union
  select 222, date '2019-01-25', 1004 union
  select 222, date '2019-02-05', 1005 union
  select 222, date '2019-02-24', 1006 union
  select 222, date '2019-02-28', 1007 union
  select 222, date '2019-03-02', 1008 union
  select 222, date '2019-03-15', 1009
), seed as (
  select t.*, row_number() over (partition by t.account_id order by day) as rn
  from t
), rec (account_id, day, event_id) as (
  select account_id, day, event_id
  from seed
  where rn = 1
  union all
  select tl.*
  from rec,
  lateral (
    select *
    from t
    where t.account_id = rec.account_id
      and t.day > rec.day + interval '30 days'
    order by t.day
    limit 1
  ) tl
)
select *
from rec
order by account_id, day;


来源:https://stackoverflow.com/questions/59006588/how-to-ignore-rows-with-moving-30-day-interval

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!