Sum duration of overlapping periods with priority by excluding the overlap itself

前端未结

关注

 2  1428

I have an R code and I am trying to rewrite it in PostgreSQL that feeds grafana dashboard. I do have the basics so I am almost done with the other parts of the script but what I

相关标签:

2条回答

予麋鹿

2021-01-22 19:48

This is a type of gaps-and-islands problem. To solve this, find where the "islands" begin and then aggregate. So, to get the islands:

select a.name, min(start) as startt, max("end") as endt
from (select a.*,
             count(*) filter (where prev_end is null or prev_end < start) over (partition by name order by start, id) as grp
      from (select a.*,
                   max("end") over (partition by name
                                    order by start, id
                                    rows between unbounded preceding and 1 preceding
                                   ) as prev_end
            from activities a
           ) a
     ) a
group by name, grp;

The next step is just to aggregate again:

with islands as (
      select a.name, min(start) as startt, max("end") as endt
      from (select a.*,
                   count(*) filter (where prev_end is null or prev_end < start) over (partition by name order by start, id) as grp
            from (select a.*,
                         max("end") over (partition by name
                                          order by start, id
                                          rows between unbounded preceding and 1 preceding
                                         ) as prev_end
                  from activities a
                 ) a
           ) a
      group by name, grp
     )
select name, sum(endt - startt)
from islands i
group by name;

Here is a db<>fiddle.

Note that this uses a cumulative trailing maximum to define the overlaps. This is the most general method for determining overlaps. I think this will work on all edge cases, including:

1----------2---2----3--3-----1

It also handles ties on the start time.

0 讨论(0)

梦如初夏

2021-01-22 19:59

Update My original solution was not correct. The consolidation of ranges cannot be handled in a regular window. I confused myself by using the same name, trange, forgetting that the window is over the source rows rather than the result rows. Please see the updated SQL Fiddle with the full query as well as an added record to illustrate the problem.

You can simplify the overlapping requirement as well as identifying gaps and islands using PostgreSQL range types.

The following query is intentionally verbose to show each step of the process. A number of steps can be combined.

SQL Fiddle

First, add an inclusive [start, end] range to each record.

with add_ranges as (
  select id, name, tsrange(start, "end", '[]') as t_range
    from activities
), 

 id | name |                    t_range                    
----+------+-----------------------------------------------
  1 | A    | ["2018-01-09 17:00:00","2018-01-09 20:00:00"]
  2 | A    | ["2018-01-09 18:00:00","2018-01-09 20:30:00"]
  3 | B    | ["2018-01-09 19:00:00","2018-01-09 21:30:00"]
  4 | B    | ["2018-01-09 22:00:00","2018-01-09 23:00:00"]
(4 rows)

Identify overlapping ranges as determined by the && operator and mark the beginning of new islands with a 1.

mark_islands as (
  select id, name, t_range,
         case
           when t_range && lag(t_range) over w then 0
           else 1
         end as new_range
    from add_ranges
  window w as (partition by name order by t_range)
),

 id | name |                    t_range                    | new_range 
----+------+-----------------------------------------------+-----------
  1 | A    | ["2018-01-09 17:00:00","2018-01-09 20:00:00"] |         1
  2 | A    | ["2018-01-09 18:00:00","2018-01-09 20:30:00"] |         0
  3 | B    | ["2018-01-09 19:00:00","2018-01-09 21:30:00"] |         1
  4 | B    | ["2018-01-09 22:00:00","2018-01-09 23:00:00"] |         1
(4 rows)

Number the groups based on the sum of the new_range within name.

group_nums as (
  select id, name, t_range, 
         sum(new_range) over (partition by name order by t_range) as group_num
    from mark_islands
),

 id | name |                    t_range                    | group_num 
----+------+-----------------------------------------------+-----------
  1 | A    | ["2018-01-09 17:00:00","2018-01-09 20:00:00"] |         1
  2 | A    | ["2018-01-09 18:00:00","2018-01-09 20:30:00"] |         1
  3 | B    | ["2018-01-09 19:00:00","2018-01-09 21:30:00"] |         1
  4 | B    | ["2018-01-09 22:00:00","2018-01-09 23:00:00"] |         2

Group by name, group_num to get the total time spent on the island as well as a complete t_range to be used in overlap deduction.

islands as (
  select name,
         tsrange(min(lower(t_range)), max(upper(t_range)), '[]') as t_range,
         max(upper(t_range)) - min(lower(t_range)) as island_time_interval
    from group_nums
   group by name, group_num
),

 name |                    t_range                    | island_time_interval 
------+-----------------------------------------------+----------------------
 A    | ["2018-01-09 17:00:00","2018-01-09 20:30:00"] | 03:30:00
 B    | ["2018-01-09 19:00:00","2018-01-09 21:30:00"] | 02:30:00
 B    | ["2018-01-09 22:00:00","2018-01-09 23:00:00"] | 01:00:00
(3 rows)

For the requirement to count overlap time between A messages and B messages, find occurrences of when an A message overlaps a B message, and use the * intersect operator to find the intersection.

priority_overlaps as (
  select b.name, a.t_range * b.t_range as overlap_range
    from islands a
    join islands b
      on a.t_range && b.t_range
     and a.name = 'A' and b.name != 'A'
),

 name |                 overlap_range                 
------+-----------------------------------------------
 B    | ["2018-01-09 19:00:00","2018-01-09 20:30:00"]
(1 row)

Sum the total time of each overlap by name.

overlap_time as (
  select name, sum(upper(overlap_range) - lower(overlap_range)) as total_overlap_interval
    from priority_overlaps
   group by name
),

 name | total_overlap_interval 
------+------------------------
 B    | 01:30:00
(1 row)

Calculate the total time for each name.

island_times as (
  select name, sum(island_time_interval) as name_time_interval
    from islands
   group by name
)

 name | name_time_interval 
------+--------------------
 B    | 03:30:00
 A    | 03:30:00
(2 rows)

Join the total time for each name to adjustments from the overlap_time CTE, and subtract the adjustment for the final duration value.

select i.name,
       i.name_time_interval - coalesce(o.total_overlap_interval, interval '0') as duration
  from island_times i
  left join overlap_time o
    on o.name = i.name
;

 name | duration 
------+----------
 B    | 02:00:00
 A    | 03:30:00
(2 rows)

0 讨论(0)