I have a table like this:
ID BEGIN END
If there are overlapping episodes for the same ID (like 2000-01-01
- 2001-1
Pure SQL
For a pure SQL-solution, look at Adam's post and read this article this article (it is written in french, however you will find out it's not too hard to read). The article was recommended to me after consulting the postgresql-mailing-list (thank you for that!).
For my data this was not suitable because all possible solutions need to self join a table at least 3 times. This turns out to be a problem for (very) large amounts of data.
Semi SQL, Semi imperative Language
If you primarily care about speed and you have the possibility to use an imperative language, you can get much faster (depending on the amount of data, of course). In my case the task performed (at least) 1.000 times faster, using R.
Steps:
(1) Get a .csv-file. Take care of sorting!!!
COPY (
SELECT "ID", "BEGIN", "END"
<sorry, for a reason I don't know StackOverflow won't let me finish my code here...>
(2) Do something like this (this code is R, but you could do something similar in any imperative language):
data - read.csv2("</path/to.csv>")
data$BEGIN - as.Date(data$BEGIN)
data$END - as.Date(data$END)
smoothingEpisodes - function (theData) {
theLength - nrow(theData)
if (theLength 2L) return(theData)
ID - as.integer(theData[["ID"]])
BEGIN - as.numeric(theData[["BEGIN"]])
END - as.numeric(theData[["END"]])
curId - ID[[1L]]
curBEGIN - BEGIN[[1L]]
curEND - END[[1L]]
out.1 - integer(length = theLength)
out.2 - out.3 - numeric(length = theLength)
j - 1L
for(i in 2:nrow(theData)) {
nextId - ID[[i]]
nextBEGIN - BEGIN[[i]]
nextEND - END[[i]]
if (curId != nextId | (curEND + 1) nextBEGIN) {
out.1[[j]] - curId
out.2[[j]] - curBEGIN
out.3[[j]] - curEND
j - j + 1L
curId - nextId
curBEGIN - nextBEGIN
curEND - nextEND
} else {
curEND - max(curEND, nextEND, na.rm = TRUE)
}
}
out.1[[j]] - curId
out.2[[j]] - curBEGIN
out.3[[j]] - curEND
theOutput - data.frame(ID = out.1[1:j], BEGIN = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))
theOutput
}
data1 - smoothingEpisodes(data)
data2 - transform(data1, TAGE = (as.numeric(data1$END - data1$BEGIN) + 1))
write.csv2(data2, file = "</path/to/output.csv>")
You can find a detailed discussion on this R-Code here: "smoothing" time data - can it be done more efficient?
Regarding your second concern, I'm not sure about PostgreSQL, but in SQL Server there's a DATEDIFF(interval, start_date, end_date) that gives you the interval specified between two dates. You could use the MIN(Begin) as a start date and MAX(End) as end date to get the interval difference. You could then use this in a case statement to output something, although you might be needing to make a sub-query or something equivalent for your scenario.
Edit: That is great news that your DBA agreed to upgrade to a newer version of PostgreSQL. The windowing functions alone make the upgrade a worthwhile investment.
My original answer—as you note—has a major flaw: a limitation of one row per id
.
Below is a better solution without such a limitation.
I have tested it using test tables on my system (8.4).
If / when you get a moment I would like to know how it performs on your data.
I also wrote up an explanation here: https://www.mechanical-meat.com/1/detail
WITH RECURSIVE t1_rec ( id, "begin", "end", n ) AS (
SELECT id, "begin", "end", n
FROM (
SELECT
id, "begin", "end",
CASE
WHEN LEAD("begin") OVER (
PARTITION BY id
ORDER BY "begin") <= ("end" + interval '2' day)
THEN 1 ELSE 0 END AS cl,
ROW_NUMBER() OVER (
PARTITION BY id
ORDER BY "begin") AS n
FROM mytable
) s
WHERE s.cl = 1
UNION ALL
SELECT p1.id, p1."begin", p1."end", a.n
FROM t1_rec a
JOIN mytable p1 ON p1.id = a.id
AND p1."begin" > a."begin"
AND (a."begin", a."end" + interval '2' day) OVERLAPS
(p1."begin", p1."end")
)
SELECT t1.id, min(t1."begin"), max(t1."end")
FROM t1_rec t1
LEFT JOIN t1_rec t2 ON t1.id = t2.id
AND t2."end" = t1."end"
AND t2.n < t1.n
WHERE t2.n IS NULL
GROUP BY t1.id, t1.n
ORDER BY t1.id, t1.n;
Original (deprecated) answer follows;
note: limitation of one row per id
.
Denis is probably right about using lead()
and lag()
, but there is yet another way!
You can also solve this problem using so-called recursive SQL.
The overlaps function also comes in handy.
I have fully tested this solution on my system (8.4).
It works well.
WITH RECURSIVE rec_stmt ( id, begin, end ) AS (
/* seed statement:
start with only first start and end dates for each id
*/
SELECT id, MIN(begin), MIN(end)
FROM mytable seed_stmt
GROUP BY id
UNION ALL
/* iterative (not really recursive) statement:
append qualifying rows to resultset
*/
SELECT t1.id, t1.begin, t1.end
FROM rec_stmt r
JOIN mytable t1 ON t1.id = r.id
AND t1.begin > r.end
AND (r.begin, r.end + INTERVAL '1' DAY) OVERLAPS
(t1.begin - INTERVAL '1' DAY, t1.end)
)
SELECT MIN(begin), MAX(end)
FROM rec_stmt
GROUP BY id;
I'm not making full sense of your question, but I'm absolutely certain that you need to look into the lead()/lag()
window functions.
Something like this, for instance, will be a good starting point to place in a subquery or a common table expression, in order to identify whether rows overlap or not per id:
select id,
lag(start) over w as prev_start,
lag(end) over w as prev_end,
start,
end,
lead(start) over w as next_start,
lead(end) over w as next_end
from yourtable
window w as (
partition by id
order by start, end
)