Let\'s say I have a table with two columns: start
and end
, both integers, and the table is ordered by the first, then second column. Each row repr
In your links you have omitted one: Can I use a SQL Server CTE to merge intersecting dates? where I present a RECURSIVE CTE solution to the overlapping intervals problem. Recursive CTE's can be handled differently (compared to ordinary self-joins), and often perform amazingly fast.
mysql does not have recursive CTEs. Postgres has them, Oracle has them, Microsoft has them.
Here Querying for a 'run' of consecutive columns in Postgres is another one, with a fudge-factor.
Here Get total time interval from multiple rows if sequence not broken is yet another one.
For now, the best answer I've found is: use indexing. This brings the complexity down from quadratic to O(n log n).
With a covering index, the queries turned out to be fast enough for my needs;
with just an index on either the start or end column, it was slower but still OK.
In each case, EXPLAIN QUERY PLAN
told me that a single table scan is combined with use of the index, as expected.
Finding an element in the index isn't quite O(1), but turned out to be close enough. And building the index isn't slow, either.
What remains is the proof that a true O(n) algorithm can't be written in SQL.
So another answer is to write it in a different language and then apply it to a SQLite table. There are various ways to make that work:
Based on the answer to my question in the comments, I don't think my idea would have worked. since you mentioned you it can (and I assume you know how) be done with joins, I had an idea of minimizing the number of rows to be joined by keeping only ranges that belong to distinct points like the following:
select start, max(end) as end
from (
select min(start) as start,end
from table
group by end
) in_tab
group by in_tab.start
the above inner select makes sure that no end point repeats and selects the longest start point for each end. the outer select does just the opposite. we end up with ranges that start and end at different points (with any FULLY contained/overlapped range removed). This might have worked if the max range was not big. if these were dates and there is maximum a year difference between lowest date in the whole table and highest date in it, then it would have been 365*364 options to pick any two points and that would have been the higher limit to the possible rows after the above select. these then could have been used in a temp table using the join method that you already have. but with the numbers you mentioned, then theoretically we have a huge number that makes this try irrelevant. even though the above minimizes the rows to be used in the calculation, they would still be too much to use in a join.
I do not know of a way to make this in ANSI SQL without joins when there is no other non standard functionality provided by the RDBMS. for example in oracle this can easily be achieved with analytical functions. the best would be in this case to use the above to minimize the number of rows used and bring them to your application and there you can write code that calculates the ranges and insert them back in the database.
Well, here is a solution that works in MySQL (I don't know if it will work in SQlite). I think, but cannot prove, that is O(n) (discarding the time it takes to sort the events table initially, i.e. if it is already sorted as I think the question states.)
> SELECT * from events;
+-------+-----+
| start | end |
+-------+-----+
| 1 | 9 |
| 5 | 8 |
| 8 | 11 |
| 11 | 13 |
| 17 | 25 |
| 18 | 26 |
| 33 | 42 |
| 59 | 81 |
| 61 | 87 |
| 97 | 132 |
| 105 | 191 |
| 107 | 240 |
| 198 | 213 |
| 202 | 215 |
+-------+-----+
14 rows in set (0.00 sec)
SET @interval_id = 0;
SET @interval_end = 0;
SELECT
MIN(start) AS start,
MAX(end) AS end
FROM
(SELECT
@interval_id := IF(start > @interval_end,
@interval_id + 1,
@interval_id) AS interval_id,
@interval_end := IF(start < @interval_end,
GREATEST(@interval_end, end),
end) AS interval_end,
events.*
FROM events
ORDER BY start,end) tmp
GROUP BY interval_id;
+-------+------+
| start | end |
+-------+------+
| 1 | 13 |
| 17 | 26 |
| 33 | 42 |
| 59 | 87 |
| 97 | 240 |
+-------+------+
5 rows in set (0.00 sec)