Aggregate Overlapping Segments to Measure Effective Length

萝らか妹 提交于 2019-12-03 02:30:01

My main DBMS is Teradata, but this will work as-is in Oracle, too.

WITH all_meas AS
 ( -- get a distinct list of all from/to points
   SELECT road_id, from_meas AS meas
   FROM road_events
   UNION
   SELECT road_id, to_meas
   FROM road_events
 )
-- select * from all_meas order by 1,2
 , all_ranges AS
 ( -- create from/to ranges
   SELECT road_id, meas AS from_meas 
     ,Lead(meas)
      Over (PARTITION BY road_id
            ORDER BY meas) AS to_meas
   FROM all_meas
  )
 -- SELECT * from all_ranges order by 1,2
, all_event_ranges AS
 ( -- now match the ranges to the event ranges
   SELECT 
      ar.*
     ,re.event_id
     ,re.year
     ,re.total_road_length
     ,ar.to_meas - ar.from_meas AS event_length
     -- used to filter the latest event as multiple events might cover the same range 
     ,Row_Number()
      Over (PARTITION BY ar.road_id, ar.from_meas
            ORDER BY year DESC) AS rn
   FROM all_ranges ar
   JOIN road_events re
     ON ar.road_id = re.road_id
    AND ar.from_meas < re.to_meas
    AND ar.to_meas > re.from_meas
   WHERE ar.to_meas IS NOT NULL
 )
SELECT event_id, road_id, year, total_road_length, Sum(event_length)
FROM all_event_ranges
WHERE rn = 1 -- latest year only
GROUP BY event_id, road_id, year, total_road_length
ORDER BY road_id, year DESC;

If you need to return the actual covered from/to_meas (as in your question before edit), it might be more complicated. The first part is the same, but without aggregation the query can return adjacent rows with the same event_id (e.g. for event 3: 0-1 & 1-25):

SELECT * FROM all_event_ranges
WHERE rn = 1
ORDER BY road_id, from_meas;

If you want to merge adjacent rows you need two more steps (using a standard approach, flag the 1st row of a group and calculate a group number):

WITH all_meas AS
 (
   SELECT road_id, from_meas AS meas
   FROM road_events
   UNION
   SELECT road_id, to_meas
   FROM road_events
 )
-- select * from all_meas order by 1,2
 , all_ranges AS
 ( 
   SELECT road_id, meas AS from_meas 
     ,Lead(meas)
      Over (PARTITION BY road_id
            ORDER BY meas) AS to_meas
   FROM all_meas
  )
-- SELECT * from all_ranges order by 1,2
, all_event_ranges AS
 (
   SELECT 
      ar.*
     ,re.event_id
     ,re.year
     ,re.total_road_length
     ,ar.to_meas - ar.from_meas AS event_length
     ,Row_Number()
      Over (PARTITION BY ar.road_id, ar.from_meas
            ORDER BY year DESC) AS rn
   FROM all_ranges ar
   JOIN road_events  re
     ON ar.road_id = re.road_id
    AND ar.from_meas < re.to_meas
    AND ar.to_meas > re.from_meas
   WHERE ar.to_meas IS NOT NULL
 )
-- SELECT * FROM all_event_ranges WHERE rn = 1 ORDER BY road_id, from_meas
, adjacent_events AS 
 ( -- assign 1 to the 1st row of an event
   SELECT t.*
     ,CASE WHEN Lag(event_id)
                Over(PARTITION BY road_id
                     ORDER BY from_meas) = event_id
           THEN 0 
           ELSE 1 
      END AS flag
   FROM all_event_ranges t
   WHERE rn = 1
 )
-- SELECT * FROM adjacent_events ORDER BY road_id, from_meas 
, grouped_events AS
 ( -- assign a groupnumber to adjacent rows using a Cumulative Sum over 0/1
   SELECT t.*
     ,Sum(flag)
      Over (PARTITION BY road_id
            ORDER BY from_meas
            ROWS Unbounded Preceding) AS grp
   FROM adjacent_events t
)
-- SELECT * FROM grouped_events ORDER BY  road_id, from_meas
SELECT event_id, road_id, year, Min(from_meas), Max(to_meas), total_road_length, Sum(event_length)
FROM grouped_events
GROUP BY event_id, road_id, grp, year, total_road_length
ORDER BY 2, Min(from_meas);

Edit:

Ups, I just found a blog Overlapping ranges with priority doing exactly the same with some simplified Oracle syntax. In fact I translated my query from a some other simplified syntax in Teradata to Standard/Oracle SQL :-)

There is another way to calculate this, with from and to values:

with 
  part_begin_point as (
    Select distinct road_id, from_meas point
    from road_events be
    union 
    Select distinct road_id, to_meas point
    from road_events ee
  )
, newest_part as (
  select e.event_id
  , e.road_id
  , e.year
  , e.total_road_length
  , p.point
  , LAG(e.event_id) over (partition by p.road_id order by p.point) prev_event
  , e.to_meas event_to_meas
  from part_begin_point p
  join road_events e
   on p.road_id = e.road_id
   and p.point >= e.from_meas and  p.point < e.to_meas
   and not exists(
        select 1 from road_events ne 
        where e.road_id = ne.road_id
        and p.point >= ne.from_meas and p.point < ne.to_meas
        and (e.year < ne.year or e.year = ne.year and e.event_id < ne.event_id))
  )
select event_id, road_id, year
, point from_meas
, LEAD(point, 1, event_to_meas) over (partition by road_id order by point) to_meas
, total_road_length
, LEAD(point, 1, event_to_meas) over (partition by road_id order by point) - point EVENT_LENGTH
from newest_part
where 1=1
and event_id <> prev_event or prev_event is null
order by event_id, point

SQL Fiddle

Thought about this too much today, but I have something that ignores the +/- 10 meters now.

First made a function that takes in to / from pairs as a string and returns the distance covered by the pairs in the string. For example '10:20;35:45' returns 20.

CREATE
    OR replace FUNCTION get_distance_range_str (strRangeStr VARCHAR2)

RETURN NUMBER IS intRetNum NUMBER;

BEGIN
    --split input string
    WITH cte_1
    AS (
        SELECT regexp_substr(strRangeStr, '[^;]+', 1, LEVEL) AS TO_FROM_STRING
        FROM dual connect BY regexp_substr(strRangeStr, '[^;]+', 1, LEVEL) IS NOT NULL
        )
        --split From/To pairs
        ,cte_2
    AS (
        SELECT cte_1.TO_FROM_STRING
            ,to_number(substr(cte_1.TO_FROM_STRING, 1, instr(cte_1.TO_FROM_STRING, ':') - 1)) AS FROM_MEAS
            ,to_number(substr(cte_1.TO_FROM_STRING, instr(cte_1.TO_FROM_STRING, ':') + 1, length(cte_1.TO_FROM_STRING) - instr(cte_1.TO_FROM_STRING, ':'))) AS TO_MEAS
        FROM cte_1
        )
        --merge ranges
        ,cte_merge_ranges
    AS (
        SELECT s1.FROM_MEAS
            ,
            --t1.TO_MEAS 
            MIN(t1.TO_MEAS) AS TO_MEAS
        FROM cte_2 s1
        INNER JOIN cte_2 t1 ON s1.FROM_MEAS <= t1.TO_MEAS
            AND NOT EXISTS (
                SELECT *
                FROM cte_2 t2
                WHERE t1.TO_MEAS >= t2.FROM_MEAS
                    AND t1.TO_MEAS < t2.TO_MEAS
                )
        WHERE NOT EXISTS (
                SELECT *
                FROM cte_2 s2
                WHERE s1.FROM_MEAS > s2.FROM_MEAS
                    AND s1.FROM_MEAS <= s2.TO_MEAS
                )
        GROUP BY s1.FROM_MEAS
        )
    SELECT sum(TO_MEAS - FROM_MEAS) AS DISTANCE_COVERED
    INTO intRetNum
    FROM cte_merge_ranges;

    RETURN intRetNum;
END;

Then wrote this query that builds a string for that function for the appropriate prior range. Couldn't use windowing with list_agg, but was able to achieve same with a correlated subquery.

--use list agg to create list of to/from pairs for rows before current row in the ordering
WITH cte_2
AS (
    SELECT T1.*
        ,(
            SELECT LISTAGG(FROM_MEAS || ':' || TO_MEAS || ';') WITHIN
            GROUP (
                    ORDER BY ORDER BY YEAR DESC, EVENT_ID DESC
                    )
            FROM road_events T2
            WHERE T1.YEAR || lpad(T1.EVENT_ID, 10,'0') < 
                T2.YEAR || lpad(T2.EVENT_ID, 10,'0')
                AND T1.ROAD_ID = T2.ROAD_ID
            GROUP BY road_id
            ) AS PRIOR_RANGES_STR
    FROM road_events T1
    )
    --get distance for prior range string - distance ignoring current row
    --get distance including current row
    ,cte_3
AS (
    SELECT cte_2.*
        ,coalesce(get_distance_range_str(PRIOR_RANGES_STR), 0) AS DIST_PRIOR
        ,get_distance_range_str(PRIOR_RANGES_STR || FROM_MEAS || ':' || TO_MEAS || ';') AS DIST_NOW
    FROM cte_2 cte_2
    )
    --distance including current row less distance ignoring current row is distance added to the range this row
    ,cte_4
AS (
    SELECT cte_3.*
        ,DIST_NOW - DIST_PRIOR AS DIST_ADDED_THIS_ROW
    FROM cte_3
    )
SELECT *
FROM cte_4
--filter out any rows with distance added as 0
WHERE DIST_ADDED_THIS_ROW > 0
ORDER BY ROAD_ID, YEAR DESC, EVENT_ID DESC

sqlfiddle here: http://sqlfiddle.com/#!4/81331/36

Looks to me like the results match yours. I left the additional columns in the final query to try to illustrate each step.

Works on the test case - might need some work to handle all possibilities in a larger data set, but I think this would be a good place to start and refine.

Credit for Overlapping range merge is first answer here: Merge overlapping date intervals

Credit for list_agg with windowing is first answer here: LISTAGG equivalent with windowing clause

I had a problem with your "road events", because of you don't describe what is 1st meas, I posit it is period between 0 and 1 without 1.

so, you can count this with one query:

with newest_MEAS as (
select ROAD_ID, MEAS.m, max(year) y
from road_events
join (select rownum -1 m 
      from dual 
      connect by rownum -1 <= (select max(TOTAL_ROAD_LENGTH) from road_events) ) MEAS
  on MEAS.m between FROM_MEAS and TO_MEAS
group by ROAD_ID, MEAS.m )
select re.event_id, nm.ROAD_ID, re.total_road_length, nm.y, count(nm.m) EVENT_LENGTH
from newest_MEAS nm
join road_events re 
  on nm.ROAD_ID = re.ROAD_ID
  and nm.m between re.from_meas and re.to_meas -1
  and nm.y = re.year
group by re.event_id, nm.ROAD_ID, re.total_road_length, nm.y
order by event_id

SQL Fiddle

Solution:

SELECT RE.road_id, RE.event_id, RE.year, RE.from_meas, RE.to_meas, RE.road_length, RE.event_length, RE.used_length, RE.leftover_length
  FROM
  (
    SELECT RE.C_road_id[road_id], RE.C_event_id[event_id], RE.C_year[year], RE.C_from_meas[from_meas], RE.C_to_meas[to_meas], RE.C_road_length[road_length],
           RE.event_length, RE.used_length, (RE.event_length - (CASE WHEN RE.HasOverlap = 1 THEN RE.used_length ELSE 0 END))[leftover_length]
      FROM
      (
        SELECT RE.C_road_id, RE.C_event_id, RE.C_year, RE.C_from_meas, RE.C_to_meas, RE.C_road_length,
               (CASE WHEN MAX(RE.A_event_id) IS NOT NULL THEN 1 ELSE 0 END)[HasOverlap],
               (RE.C_to_meas - RE.C_from_meas)[event_length],
               SUM(   (CASE WHEN RE.O_to_meas <= RE.C_to_meas THEN RE.O_to_meas ELSE RE.C_to_meas END)
                    - (CASE WHEN RE.O_from_meas >= RE.C_from_meas THEN RE.O_from_meas ELSE RE.C_from_meas END)
                  )[used_length]--This is the length that is already being counted towards later years.
          FROM
          (
            SELECT RE.C_road_id, RE.C_event_id, RE.C_year, RE.C_from_meas, RE.C_to_meas, RE.C_road_length,
                   RE.A_event_id, MIN(RE.O_from_meas)[O_from_meas], MAX(RE.O_to_meas)[O_to_meas]
              FROM
              (
                SELECT RE_C.road_id[C_road_id], RE_C.event_id[C_event_id], RE_C.year[C_year], RE_C.from_meas[C_from_meas], RE_C.to_meas[C_to_meas], RE_C.total_road_length[C_road_length],
                       RE_A.road_id[A_road_id], RE_A.event_id[A_event_id], RE_A.year[A_year], RE_A.from_meas[A_from_meas], RE_A.to_meas[A_to_meas], RE_A.total_road_length[A_road_length],
                       RE_O.road_id[O_road_id], RE_O.event_id[O_event_id], RE_O.year[O_year], RE_O.from_meas[O_from_meas], RE_O.to_meas[O_to_meas], RE_O.total_road_length[O_road_length],
                       (ROW_NUMBER() OVER (PARTITION BY RE_C.road_id, RE_C.event_id, RE_O.event_id ORDER BY RE_S.Overlap DESC, RE_A.event_id))[RowNum]--Use to Group Overlaps into Swaths.
                  FROM road_events as RE_C--Current.
                  LEFT JOIN road_events as RE_A--After.  --Use a Left-Join to capture when there is only 1 Event (or it is the Last-Event in the list).
                    ON RE_A.road_id   = RE_C.road_id
                   AND RE_A.event_id != RE_C.event_id--Not the same EventID.
                   AND RE_A.year     >= RE_C.year--Occured on or After the Current Event.
                   AND (    (RE_A.from_meas >= RE_C.from_meas AND RE_A.from_meas <= RE_C.to_meas)--There is Overlap.
                         OR (RE_A.to_meas   >= RE_C.from_meas AND RE_A.to_meas   <= RE_C.to_meas)--There is Overlap.
                         OR (RE_A.to_meas    = RE_C.to_meas   AND RE_A.from_meas  = RE_C.from_meas)--They are Equal.
                       )
                  LEFT JOIN road_events as RE_O--Overlapped/Linked.
                    ON RE_O.road_id   = RE_C.road_id
                   AND RE_O.event_id != RE_C.event_id--Not the same EventID.
                   AND RE_O.year     >= RE_C.year--Occured on or After the Current Event.
                   AND (    (RE_O.from_meas >= RE_A.from_meas AND RE_O.from_meas <= RE_A.to_meas)--There is Overlap.
                         OR (RE_O.to_meas   >= RE_A.from_meas AND RE_O.to_meas   <= RE_A.to_meas)--There is Overlap.
                         OR (RE_O.to_meas    = RE_A.to_meas   AND RE_O.from_meas  = RE_A.from_meas)--They are Equal.
                       )
                  OUTER APPLY
                  (
                    SELECT COUNT(*)[Overlap]
                      FROM road_events as RE_O--Overlapped/Linked.
                     WHERE RE_O.road_id   = RE_C.road_id
                       AND RE_O.event_id != RE_C.event_id--Not the same EventID.
                       AND RE_O.year     >= RE_C.year--Occured on or After the Current Event.
                       AND (    (RE_O.from_meas >= RE_A.from_meas AND RE_O.from_meas <= RE_A.to_meas)--There is Overlap.
                             OR (RE_O.to_meas   >= RE_A.from_meas AND RE_O.to_meas   <= RE_A.to_meas)--There is Overlap.
                             OR (RE_O.to_meas    = RE_A.to_meas   AND RE_O.from_meas  = RE_A.from_meas)--They are Equal.
                           )
                  ) AS RE_S--Swath of Overlaps.
              ) AS RE
             WHERE RE.RowNum = 1--Remove Duplicates and Select those that are in the biggest Swaths.
             GROUP BY RE.C_road_id, RE.C_event_id, RE.C_year, RE.C_from_meas, RE.C_to_meas, RE.C_road_length,
                      RE.A_event_id
          ) AS RE
         GROUP BY RE.C_road_id, RE.C_event_id, RE.C_year, RE.C_from_meas, RE.C_to_meas, RE.C_road_length
      ) AS RE
  ) AS RE
 WHERE RE.leftover_length > 0--Filter out Events that had their entire Segments overlapped by a Later Event(s).
 ORDER BY RE.road_id, RE.year DESC, RE.event_id

SQL Fiddle:
    http://sqlfiddle.com/#!18/2880b/1

Added Rules/Assumptions/Clarifications:
1.) Allow for the possibility event_id and road_id could be Guid's or created out-of-order,
    so do not script assuming higher or lower values give meaning to the relationship of records.
    For Example:
      An ID of 1 and and ID of 2 does not guarantee the ID of 2 is the most recent one (and vice-versa).
      This is so the solution will be more general and less "hacky".
2.) Filter out Events that had their entire Segments overlapped by a Later Event(s).
    For Example:
      If 2008 had work on 20-50 and 2009 had work on 10-60,
      then the Event for 2008 would be filtered out because its entire Segment was rehashed in 2009.

Additional Test Data:
To ensure solutions are not tailored to only the DataSet given,
    I have added a road_id of 6 to the original DataSet, in order to hit a few more fringe-cases.

INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (16,6,2012,0,100,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (17,6,2013,68,69,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (18,6,2014,65,66,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (19,6,2015,62,63,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (20,6,2016,50,60,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (21,6,2017,30,40,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (22,6,2017,20,55,100);
INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (23,6,2018,0,25,100);

Results: (with the 8 Additional Records I added in Green)

Database Version:
This Solution is Oracle and SQL-Server Agnostic:
    It Should Work in both SS2008+ and Oracle 12c+.

This question is tagged with Oracle 12c, but there is no online-fiddle I may use without signing up,
    so I tested it in SQL-Server - but the same syntax should work in both.
I rely on Cross-Apply and Outer-Apply for most of my queries.
Oracle introduced these "Joins" in 12c:
    https://oracle-base.com/articles/12c/lateral-inline-views-cross-apply-and-outer-apply-joins-12cr1

Simplified and Performant:
This uses:
    • No Correlated Subqueries.
    • No Recursion.
    • No CTE's.
    • No Unions.
    • No User Functions.

Indexes:
I read in one of your comments you had asked about Indexes.
I would add 1-Column Indexes for each the main Fields you will be searching and grouping on:
    road_id, event_id, and year.
You could see if this index would help you any (I don't know how you plan to use the data):
    Key Fields: road_id, event_id, year
    Include: from_meas, to_meas

Title:
You may want to consider Renaming the Title of this Question to something more searchable like:
    "Aggregate Overlapping Segments to Measure Effective Length".
This would allow the solution to be easier to find for helping others with similar problems.

Other Thoughts:
Something like this would be useful in Tallying up the Overall-Time spent on something
    with overlapping Start and Stop timestamps.

This finds expands the table to produce a row for each mile of each road, and simply takes the MAX year. We can just then COUNT the number of rows to produce the event_length.

It produces the table exactly as you specified above.

Note: I ran this query against SQL Server. You could use LEAST instead of SELECT MIN(event_length) FROM (VALUES...) in Oracle I think.

WITH NumberRange(result) AS 
(
    SELECT 0
    UNION ALL
    SELECT result + 1
    FROM   NumberRange 
    WHERE  result < 301 --Max length of any road
),
CurrentRoadEventLength(road_id, [year], event_length) AS
(
    SELECT road_id, [year], COUNT(*) AS event_length
    FROM   (
            SELECT re.road_id, n.result, MAX(re.[year]) as [year]
            FROM   road_events re INNER JOIN NumberRange n 
                   ON (    re.from_meas <= n.result 
                       AND re.to_meas > n.result
                      )
            GROUP BY re.road_id, n.result
           ) events_per_mile
    GROUP BY road_id, [year]
)
SELECT re.event_id, re.road_id, re.[year], re.total_road_length, 
       (SELECT MIN(event_length) FROM (VALUES (re.to_meas - re.from_meas), (cre.event_length)) AS EventLengths(event_length))
FROM   road_events re INNER JOIN CurrentRoadEventLength cre
       ON (    re.road_id = cre.road_id
           AND re.[year] = cre.[year]
          )
ORDER BY re.event_id, re.road_id
OPTION (MAXRECURSION 301) --Max length of any road
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!