Merging intervals in one pass in SQL

前端 未结 4 733
轻奢々
轻奢々 2020-12-06 13:02

Let\'s say I have a table with two columns: start and end, both integers, and the table is ordered by the first, then second column. Each row repr

相关标签:
4条回答
  • 2020-12-06 13:40

    In your links you have omitted one: Can I use a SQL Server CTE to merge intersecting dates? where I present a RECURSIVE CTE solution to the overlapping intervals problem. Recursive CTE's can be handled differently (compared to ordinary self-joins), and often perform amazingly fast.

    mysql does not have recursive CTEs. Postgres has them, Oracle has them, Microsoft has them.

    Here Querying for a 'run' of consecutive columns in Postgres is another one, with a fudge-factor.

    Here Get total time interval from multiple rows if sequence not broken is yet another one.

    0 讨论(0)
  • 2020-12-06 13:41

    For now, the best answer I've found is: use indexing. This brings the complexity down from quadratic to O(n log n).

    With a covering index, the queries turned out to be fast enough for my needs; with just an index on either the start or end column, it was slower but still OK. In each case, EXPLAIN QUERY PLAN told me that a single table scan is combined with use of the index, as expected.

    Finding an element in the index isn't quite O(1), but turned out to be close enough. And building the index isn't slow, either.

    What remains is the proof that a true O(n) algorithm can't be written in SQL.

    So another answer is to write it in a different language and then apply it to a SQLite table. There are various ways to make that work:

    • export the table to a CSV file; read the CSV file, apply the algorithm, produce CSV; import the resulting CSV file as a table;
    • use a SQLite driver for that language (e.g. DBD::SQLite for Perl, RSQLite for R)
    • write a SQLite extension function that somehow interfaces with the language of choice
    0 讨论(0)
  • 2020-12-06 13:53

    Based on the answer to my question in the comments, I don't think my idea would have worked. since you mentioned you it can (and I assume you know how) be done with joins, I had an idea of minimizing the number of rows to be joined by keeping only ranges that belong to distinct points like the following:

    select start, max(end) as end
    from (
          select min(start) as start,end
          from table
          group by end
         ) in_tab
    group by in_tab.start
    

    the above inner select makes sure that no end point repeats and selects the longest start point for each end. the outer select does just the opposite. we end up with ranges that start and end at different points (with any FULLY contained/overlapped range removed). This might have worked if the max range was not big. if these were dates and there is maximum a year difference between lowest date in the whole table and highest date in it, then it would have been 365*364 options to pick any two points and that would have been the higher limit to the possible rows after the above select. these then could have been used in a temp table using the join method that you already have. but with the numbers you mentioned, then theoretically we have a huge number that makes this try irrelevant. even though the above minimizes the rows to be used in the calculation, they would still be too much to use in a join.

    I do not know of a way to make this in ANSI SQL without joins when there is no other non standard functionality provided by the RDBMS. for example in oracle this can easily be achieved with analytical functions. the best would be in this case to use the above to minimize the number of rows used and bring them to your application and there you can write code that calculates the ranges and insert them back in the database.

    0 讨论(0)
  • 2020-12-06 13:55

    Well, here is a solution that works in MySQL (I don't know if it will work in SQlite). I think, but cannot prove, that is O(n) (discarding the time it takes to sort the events table initially, i.e. if it is already sorted as I think the question states.)

    > SELECT * from events;
    +-------+-----+
    | start | end |
    +-------+-----+
    |     1 |   9 |
    |     5 |   8 |
    |     8 |  11 |
    |    11 |  13 |
    |    17 |  25 |
    |    18 |  26 |
    |    33 |  42 |
    |    59 |  81 |
    |    61 |  87 |
    |    97 | 132 |
    |   105 | 191 |
    |   107 | 240 |
    |   198 | 213 |
    |   202 | 215 |
    +-------+-----+
    14 rows in set (0.00 sec)
    
    
    SET @interval_id = 0;
    SET @interval_end = 0;
    
    SELECT
      MIN(start) AS start,
      MAX(end) AS end
      FROM
        (SELECT
           @interval_id := IF(start > @interval_end,
                              @interval_id + 1,
                              @interval_id) AS interval_id,
           @interval_end := IF(start < @interval_end,
                               GREATEST(@interval_end, end),
                               end) AS interval_end,
           events.*
         FROM events
         ORDER BY start,end) tmp
      GROUP BY interval_id;
    
    +-------+------+
    | start | end  |
    +-------+------+
    |     1 |   13 |
    |    17 |   26 |
    |    33 |   42 |
    |    59 |   87 |
    |    97 |  240 |
    +-------+------+
    5 rows in set (0.00 sec)
    
    0 讨论(0)
提交回复
热议问题