Aggregate Overlapping Segments to Measure Effective Length

前端 未结 6 786
我寻月下人不归
我寻月下人不归 2021-02-07 02:08

I have a road_events table:

create table road_events (
    event_id number(4,0),
    road_id number(4,0),
    year number(4,0),
    from_meas number         


        
相关标签:
6条回答
  • 2021-02-07 02:22

    This finds expands the table to produce a row for each mile of each road, and simply takes the MAX year. We can just then COUNT the number of rows to produce the event_length.

    It produces the table exactly as you specified above.

    Note: I ran this query against SQL Server. You could use LEAST instead of SELECT MIN(event_length) FROM (VALUES...) in Oracle I think.

    WITH NumberRange(result) AS 
    (
        SELECT 0
        UNION ALL
        SELECT result + 1
        FROM   NumberRange 
        WHERE  result < 301 --Max length of any road
    ),
    CurrentRoadEventLength(road_id, [year], event_length) AS
    (
        SELECT road_id, [year], COUNT(*) AS event_length
        FROM   (
                SELECT re.road_id, n.result, MAX(re.[year]) as [year]
                FROM   road_events re INNER JOIN NumberRange n 
                       ON (    re.from_meas <= n.result 
                           AND re.to_meas > n.result
                          )
                GROUP BY re.road_id, n.result
               ) events_per_mile
        GROUP BY road_id, [year]
    )
    SELECT re.event_id, re.road_id, re.[year], re.total_road_length, 
           (SELECT MIN(event_length) FROM (VALUES (re.to_meas - re.from_meas), (cre.event_length)) AS EventLengths(event_length))
    FROM   road_events re INNER JOIN CurrentRoadEventLength cre
           ON (    re.road_id = cre.road_id
               AND re.[year] = cre.[year]
              )
    ORDER BY re.event_id, re.road_id
    OPTION (MAXRECURSION 301) --Max length of any road
    
    0 讨论(0)
  • 2021-02-07 02:23

    My main DBMS is Teradata, but this will work as-is in Oracle, too.

    WITH all_meas AS
     ( -- get a distinct list of all from/to points
       SELECT road_id, from_meas AS meas
       FROM road_events
       UNION
       SELECT road_id, to_meas
       FROM road_events
     )
    -- select * from all_meas order by 1,2
     , all_ranges AS
     ( -- create from/to ranges
       SELECT road_id, meas AS from_meas 
         ,Lead(meas)
          Over (PARTITION BY road_id
                ORDER BY meas) AS to_meas
       FROM all_meas
      )
     -- SELECT * from all_ranges order by 1,2
    , all_event_ranges AS
     ( -- now match the ranges to the event ranges
       SELECT 
          ar.*
         ,re.event_id
         ,re.year
         ,re.total_road_length
         ,ar.to_meas - ar.from_meas AS event_length
         -- used to filter the latest event as multiple events might cover the same range 
         ,Row_Number()
          Over (PARTITION BY ar.road_id, ar.from_meas
                ORDER BY year DESC) AS rn
       FROM all_ranges ar
       JOIN road_events re
         ON ar.road_id = re.road_id
        AND ar.from_meas < re.to_meas
        AND ar.to_meas > re.from_meas
       WHERE ar.to_meas IS NOT NULL
     )
    SELECT event_id, road_id, year, total_road_length, Sum(event_length)
    FROM all_event_ranges
    WHERE rn = 1 -- latest year only
    GROUP BY event_id, road_id, year, total_road_length
    ORDER BY road_id, year DESC;
    

    If you need to return the actual covered from/to_meas (as in your question before edit), it might be more complicated. The first part is the same, but without aggregation the query can return adjacent rows with the same event_id (e.g. for event 3: 0-1 & 1-25):

    SELECT * FROM all_event_ranges
    WHERE rn = 1
    ORDER BY road_id, from_meas;
    

    If you want to merge adjacent rows you need two more steps (using a standard approach, flag the 1st row of a group and calculate a group number):

    WITH all_meas AS
     (
       SELECT road_id, from_meas AS meas
       FROM road_events
       UNION
       SELECT road_id, to_meas
       FROM road_events
     )
    -- select * from all_meas order by 1,2
     , all_ranges AS
     ( 
       SELECT road_id, meas AS from_meas 
         ,Lead(meas)
          Over (PARTITION BY road_id
                ORDER BY meas) AS to_meas
       FROM all_meas
      )
    -- SELECT * from all_ranges order by 1,2
    , all_event_ranges AS
     (
       SELECT 
          ar.*
         ,re.event_id
         ,re.year
         ,re.total_road_length
         ,ar.to_meas - ar.from_meas AS event_length
         ,Row_Number()
          Over (PARTITION BY ar.road_id, ar.from_meas
                ORDER BY year DESC) AS rn
       FROM all_ranges ar
       JOIN road_events  re
         ON ar.road_id = re.road_id
        AND ar.from_meas < re.to_meas
        AND ar.to_meas > re.from_meas
       WHERE ar.to_meas IS NOT NULL
     )
    -- SELECT * FROM all_event_ranges WHERE rn = 1 ORDER BY road_id, from_meas
    , adjacent_events AS 
     ( -- assign 1 to the 1st row of an event
       SELECT t.*
         ,CASE WHEN Lag(event_id)
                    Over(PARTITION BY road_id
                         ORDER BY from_meas) = event_id
               THEN 0 
               ELSE 1 
          END AS flag
       FROM all_event_ranges t
       WHERE rn = 1
     )
    -- SELECT * FROM adjacent_events ORDER BY road_id, from_meas 
    , grouped_events AS
     ( -- assign a groupnumber to adjacent rows using a Cumulative Sum over 0/1
       SELECT t.*
         ,Sum(flag)
          Over (PARTITION BY road_id
                ORDER BY from_meas
                ROWS Unbounded Preceding) AS grp
       FROM adjacent_events t
    )
    -- SELECT * FROM grouped_events ORDER BY  road_id, from_meas
    SELECT event_id, road_id, year, Min(from_meas), Max(to_meas), total_road_length, Sum(event_length)
    FROM grouped_events
    GROUP BY event_id, road_id, grp, year, total_road_length
    ORDER BY 2, Min(from_meas);
    

    Edit:

    Ups, I just found a blog Overlapping ranges with priority doing exactly the same with some simplified Oracle syntax. In fact I translated my query from a some other simplified syntax in Teradata to Standard/Oracle SQL :-)

    0 讨论(0)
  • 2021-02-07 02:26

    There is another way to calculate this, with from and to values:

    with 
      part_begin_point as (
        Select distinct road_id, from_meas point
        from road_events be
        union 
        Select distinct road_id, to_meas point
        from road_events ee
      )
    , newest_part as (
      select e.event_id
      , e.road_id
      , e.year
      , e.total_road_length
      , p.point
      , LAG(e.event_id) over (partition by p.road_id order by p.point) prev_event
      , e.to_meas event_to_meas
      from part_begin_point p
      join road_events e
       on p.road_id = e.road_id
       and p.point >= e.from_meas and  p.point < e.to_meas
       and not exists(
            select 1 from road_events ne 
            where e.road_id = ne.road_id
            and p.point >= ne.from_meas and p.point < ne.to_meas
            and (e.year < ne.year or e.year = ne.year and e.event_id < ne.event_id))
      )
    select event_id, road_id, year
    , point from_meas
    , LEAD(point, 1, event_to_meas) over (partition by road_id order by point) to_meas
    , total_road_length
    , LEAD(point, 1, event_to_meas) over (partition by road_id order by point) - point EVENT_LENGTH
    from newest_part
    where 1=1
    and event_id <> prev_event or prev_event is null
    order by event_id, point
    

    SQL Fiddle

    0 讨论(0)
  • 2021-02-07 02:27

    I had a problem with your "road events", because of you don't describe what is 1st meas, I posit it is period between 0 and 1 without 1.

    so, you can count this with one query:

    with newest_MEAS as (
    select ROAD_ID, MEAS.m, max(year) y
    from road_events
    join (select rownum -1 m 
          from dual 
          connect by rownum -1 <= (select max(TOTAL_ROAD_LENGTH) from road_events) ) MEAS
      on MEAS.m between FROM_MEAS and TO_MEAS
    group by ROAD_ID, MEAS.m )
    select re.event_id, nm.ROAD_ID, re.total_road_length, nm.y, count(nm.m) EVENT_LENGTH
    from newest_MEAS nm
    join road_events re 
      on nm.ROAD_ID = re.ROAD_ID
      and nm.m between re.from_meas and re.to_meas -1
      and nm.y = re.year
    group by re.event_id, nm.ROAD_ID, re.total_road_length, nm.y
    order by event_id
    

    SQL Fiddle

    0 讨论(0)
  • 2021-02-07 02:28

    Solution:

    SELECT RE.road_id, RE.event_id, RE.year, RE.from_meas, RE.to_meas, RE.road_length, RE.event_length, RE.used_length, RE.leftover_length
      FROM
      (
        SELECT RE.C_road_id[road_id], RE.C_event_id[event_id], RE.C_year[year], RE.C_from_meas[from_meas], RE.C_to_meas[to_meas], RE.C_road_length[road_length],
               RE.event_length, RE.used_length, (RE.event_length - (CASE WHEN RE.HasOverlap = 1 THEN RE.used_length ELSE 0 END))[leftover_length]
          FROM
          (
            SELECT RE.C_road_id, RE.C_event_id, RE.C_year, RE.C_from_meas, RE.C_to_meas, RE.C_road_length,
                   (CASE WHEN MAX(RE.A_event_id) IS NOT NULL THEN 1 ELSE 0 END)[HasOverlap],
                   (RE.C_to_meas - RE.C_from_meas)[event_length],
                   SUM(   (CASE WHEN RE.O_to_meas <= RE.C_to_meas THEN RE.O_to_meas ELSE RE.C_to_meas END)
                        - (CASE WHEN RE.O_from_meas >= RE.C_from_meas THEN RE.O_from_meas ELSE RE.C_from_meas END)
                      )[used_length]--This is the length that is already being counted towards later years.
              FROM
              (
                SELECT RE.C_road_id, RE.C_event_id, RE.C_year, RE.C_from_meas, RE.C_to_meas, RE.C_road_length,
                       RE.A_event_id, MIN(RE.O_from_meas)[O_from_meas], MAX(RE.O_to_meas)[O_to_meas]
                  FROM
                  (
                    SELECT RE_C.road_id[C_road_id], RE_C.event_id[C_event_id], RE_C.year[C_year], RE_C.from_meas[C_from_meas], RE_C.to_meas[C_to_meas], RE_C.total_road_length[C_road_length],
                           RE_A.road_id[A_road_id], RE_A.event_id[A_event_id], RE_A.year[A_year], RE_A.from_meas[A_from_meas], RE_A.to_meas[A_to_meas], RE_A.total_road_length[A_road_length],
                           RE_O.road_id[O_road_id], RE_O.event_id[O_event_id], RE_O.year[O_year], RE_O.from_meas[O_from_meas], RE_O.to_meas[O_to_meas], RE_O.total_road_length[O_road_length],
                           (ROW_NUMBER() OVER (PARTITION BY RE_C.road_id, RE_C.event_id, RE_O.event_id ORDER BY RE_S.Overlap DESC, RE_A.event_id))[RowNum]--Use to Group Overlaps into Swaths.
                      FROM road_events as RE_C--Current.
                      LEFT JOIN road_events as RE_A--After.  --Use a Left-Join to capture when there is only 1 Event (or it is the Last-Event in the list).
                        ON RE_A.road_id   = RE_C.road_id
                       AND RE_A.event_id != RE_C.event_id--Not the same EventID.
                       AND RE_A.year     >= RE_C.year--Occured on or After the Current Event.
                       AND (    (RE_A.from_meas >= RE_C.from_meas AND RE_A.from_meas <= RE_C.to_meas)--There is Overlap.
                             OR (RE_A.to_meas   >= RE_C.from_meas AND RE_A.to_meas   <= RE_C.to_meas)--There is Overlap.
                             OR (RE_A.to_meas    = RE_C.to_meas   AND RE_A.from_meas  = RE_C.from_meas)--They are Equal.
                           )
                      LEFT JOIN road_events as RE_O--Overlapped/Linked.
                        ON RE_O.road_id   = RE_C.road_id
                       AND RE_O.event_id != RE_C.event_id--Not the same EventID.
                       AND RE_O.year     >= RE_C.year--Occured on or After the Current Event.
                       AND (    (RE_O.from_meas >= RE_A.from_meas AND RE_O.from_meas <= RE_A.to_meas)--There is Overlap.
                             OR (RE_O.to_meas   >= RE_A.from_meas AND RE_O.to_meas   <= RE_A.to_meas)--There is Overlap.
                             OR (RE_O.to_meas    = RE_A.to_meas   AND RE_O.from_meas  = RE_A.from_meas)--They are Equal.
                           )
                      OUTER APPLY
                      (
                        SELECT COUNT(*)[Overlap]
                          FROM road_events as RE_O--Overlapped/Linked.
                         WHERE RE_O.road_id   = RE_C.road_id
                           AND RE_O.event_id != RE_C.event_id--Not the same EventID.
                           AND RE_O.year     >= RE_C.year--Occured on or After the Current Event.
                           AND (    (RE_O.from_meas >= RE_A.from_meas AND RE_O.from_meas <= RE_A.to_meas)--There is Overlap.
                                 OR (RE_O.to_meas   >= RE_A.from_meas AND RE_O.to_meas   <= RE_A.to_meas)--There is Overlap.
                                 OR (RE_O.to_meas    = RE_A.to_meas   AND RE_O.from_meas  = RE_A.from_meas)--They are Equal.
                               )
                      ) AS RE_S--Swath of Overlaps.
                  ) AS RE
                 WHERE RE.RowNum = 1--Remove Duplicates and Select those that are in the biggest Swaths.
                 GROUP BY RE.C_road_id, RE.C_event_id, RE.C_year, RE.C_from_meas, RE.C_to_meas, RE.C_road_length,
                          RE.A_event_id
              ) AS RE
             GROUP BY RE.C_road_id, RE.C_event_id, RE.C_year, RE.C_from_meas, RE.C_to_meas, RE.C_road_length
          ) AS RE
      ) AS RE
     WHERE RE.leftover_length > 0--Filter out Events that had their entire Segments overlapped by a Later Event(s).
     ORDER BY RE.road_id, RE.year DESC, RE.event_id
    

    SQL Fiddle:
        http://sqlfiddle.com/#!18/2880b/1

    Added Rules/Assumptions/Clarifications:
    1.) Allow for the possibility event_id and road_id could be Guid's or created out-of-order,
        so do not script assuming higher or lower values give meaning to the relationship of records.
        For Example:
          An ID of 1 and and ID of 2 does not guarantee the ID of 2 is the most recent one (and vice-versa).
          This is so the solution will be more general and less "hacky".
    2.) Filter out Events that had their entire Segments overlapped by a Later Event(s).
        For Example:
          If 2008 had work on 20-50 and 2009 had work on 10-60,
          then the Event for 2008 would be filtered out because its entire Segment was rehashed in 2009.

    Additional Test Data:
    To ensure solutions are not tailored to only the DataSet given,
        I have added a road_id of 6 to the original DataSet, in order to hit a few more fringe-cases.

    INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (16,6,2012,0,100,100);
    INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (17,6,2013,68,69,100);
    INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (18,6,2014,65,66,100);
    INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (19,6,2015,62,63,100);
    INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (20,6,2016,50,60,100);
    INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (21,6,2017,30,40,100);
    INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (22,6,2017,20,55,100);
    INSERT INTO road_events (event_id, road_id, year, from_meas, to_meas, total_road_length) VALUES (23,6,2018,0,25,100);
    

    Results: (with the 8 Additional Records I added in Green)

    Database Version:
    This Solution is Oracle and SQL-Server Agnostic:
        It Should Work in both SS2008+ and Oracle 12c+.

    This question is tagged with Oracle 12c, but there is no online-fiddle I may use without signing up,
        so I tested it in SQL-Server - but the same syntax should work in both.
    I rely on Cross-Apply and Outer-Apply for most of my queries.
    Oracle introduced these "Joins" in 12c:
        https://oracle-base.com/articles/12c/lateral-inline-views-cross-apply-and-outer-apply-joins-12cr1

    Simplified and Performant:
    This uses:
        • No Correlated Subqueries.
        • No Recursion.
        • No CTE's.
        • No Unions.
        • No User Functions.

    Indexes:
    I read in one of your comments you had asked about Indexes.
    I would add 1-Column Indexes for each the main Fields you will be searching and grouping on:
        road_id, event_id, and year.
    You could see if this index would help you any (I don't know how you plan to use the data):
        Key Fields: road_id, event_id, year
        Include: from_meas, to_meas

    Title:
    You may want to consider Renaming the Title of this Question to something more searchable like:
        "Aggregate Overlapping Segments to Measure Effective Length".
    This would allow the solution to be easier to find for helping others with similar problems.

    Other Thoughts:
    Something like this would be useful in Tallying up the Overall-Time spent on something
        with overlapping Start and Stop timestamps.

    0 讨论(0)
  • 2021-02-07 02:29

    Thought about this too much today, but I have something that ignores the +/- 10 meters now.

    First made a function that takes in to / from pairs as a string and returns the distance covered by the pairs in the string. For example '10:20;35:45' returns 20.

    CREATE
        OR replace FUNCTION get_distance_range_str (strRangeStr VARCHAR2)
    
    RETURN NUMBER IS intRetNum NUMBER;
    
    BEGIN
        --split input string
        WITH cte_1
        AS (
            SELECT regexp_substr(strRangeStr, '[^;]+', 1, LEVEL) AS TO_FROM_STRING
            FROM dual connect BY regexp_substr(strRangeStr, '[^;]+', 1, LEVEL) IS NOT NULL
            )
            --split From/To pairs
            ,cte_2
        AS (
            SELECT cte_1.TO_FROM_STRING
                ,to_number(substr(cte_1.TO_FROM_STRING, 1, instr(cte_1.TO_FROM_STRING, ':') - 1)) AS FROM_MEAS
                ,to_number(substr(cte_1.TO_FROM_STRING, instr(cte_1.TO_FROM_STRING, ':') + 1, length(cte_1.TO_FROM_STRING) - instr(cte_1.TO_FROM_STRING, ':'))) AS TO_MEAS
            FROM cte_1
            )
            --merge ranges
            ,cte_merge_ranges
        AS (
            SELECT s1.FROM_MEAS
                ,
                --t1.TO_MEAS 
                MIN(t1.TO_MEAS) AS TO_MEAS
            FROM cte_2 s1
            INNER JOIN cte_2 t1 ON s1.FROM_MEAS <= t1.TO_MEAS
                AND NOT EXISTS (
                    SELECT *
                    FROM cte_2 t2
                    WHERE t1.TO_MEAS >= t2.FROM_MEAS
                        AND t1.TO_MEAS < t2.TO_MEAS
                    )
            WHERE NOT EXISTS (
                    SELECT *
                    FROM cte_2 s2
                    WHERE s1.FROM_MEAS > s2.FROM_MEAS
                        AND s1.FROM_MEAS <= s2.TO_MEAS
                    )
            GROUP BY s1.FROM_MEAS
            )
        SELECT sum(TO_MEAS - FROM_MEAS) AS DISTANCE_COVERED
        INTO intRetNum
        FROM cte_merge_ranges;
    
        RETURN intRetNum;
    END;
    

    Then wrote this query that builds a string for that function for the appropriate prior range. Couldn't use windowing with list_agg, but was able to achieve same with a correlated subquery.

    --use list agg to create list of to/from pairs for rows before current row in the ordering
    WITH cte_2
    AS (
        SELECT T1.*
            ,(
                SELECT LISTAGG(FROM_MEAS || ':' || TO_MEAS || ';') WITHIN
                GROUP (
                        ORDER BY ORDER BY YEAR DESC, EVENT_ID DESC
                        )
                FROM road_events T2
                WHERE T1.YEAR || lpad(T1.EVENT_ID, 10,'0') < 
                    T2.YEAR || lpad(T2.EVENT_ID, 10,'0')
                    AND T1.ROAD_ID = T2.ROAD_ID
                GROUP BY road_id
                ) AS PRIOR_RANGES_STR
        FROM road_events T1
        )
        --get distance for prior range string - distance ignoring current row
        --get distance including current row
        ,cte_3
    AS (
        SELECT cte_2.*
            ,coalesce(get_distance_range_str(PRIOR_RANGES_STR), 0) AS DIST_PRIOR
            ,get_distance_range_str(PRIOR_RANGES_STR || FROM_MEAS || ':' || TO_MEAS || ';') AS DIST_NOW
        FROM cte_2 cte_2
        )
        --distance including current row less distance ignoring current row is distance added to the range this row
        ,cte_4
    AS (
        SELECT cte_3.*
            ,DIST_NOW - DIST_PRIOR AS DIST_ADDED_THIS_ROW
        FROM cte_3
        )
    SELECT *
    FROM cte_4
    --filter out any rows with distance added as 0
    WHERE DIST_ADDED_THIS_ROW > 0
    ORDER BY ROAD_ID, YEAR DESC, EVENT_ID DESC
    

    sqlfiddle here: http://sqlfiddle.com/#!4/81331/36

    Looks to me like the results match yours. I left the additional columns in the final query to try to illustrate each step.

    Works on the test case - might need some work to handle all possibilities in a larger data set, but I think this would be a good place to start and refine.

    Credit for Overlapping range merge is first answer here: Merge overlapping date intervals

    Credit for list_agg with windowing is first answer here: LISTAGG equivalent with windowing clause

    0 讨论(0)
提交回复
热议问题