merge DATE-rows if episodes are in direct succession or overlapping

后端 未结 4 862
我寻月下人不归
我寻月下人不归 2021-01-14 07:47

I have a table like this:

ID    BEGIN    END

If there are overlapping episodes for the same ID (like 2000-01-01 - 2001-1

相关标签:
4条回答
  • 2021-01-14 08:15

    Pure SQL

    For a pure SQL-solution, look at Adam's post and read this article this article (it is written in french, however you will find out it's not too hard to read). The article was recommended to me after consulting the postgresql-mailing-list (thank you for that!).

    For my data this was not suitable because all possible solutions need to self join a table at least 3 times. This turns out to be a problem for (very) large amounts of data.

    Semi SQL, Semi imperative Language

    If you primarily care about speed and you have the possibility to use an imperative language, you can get much faster (depending on the amount of data, of course). In my case the task performed (at least) 1.000 times faster, using R.

    Steps:

    (1) Get a .csv-file. Take care of sorting!!!

    COPY (
      SELECT "ID", "BEGIN", "END"
      <sorry, for a reason I don't know StackOverflow won't let me finish my code here...>
    

    (2) Do something like this (this code is R, but you could do something similar in any imperative language):

    data - read.csv2("</path/to.csv>")
    data$BEGIN - as.Date(data$BEGIN)
    data$END - as.Date(data$END)
    
    smoothingEpisodes - function (theData) {
    
        theLength - nrow(theData)
        if (theLength  2L) return(theData)
    
        ID - as.integer(theData[["ID"]])
        BEGIN - as.numeric(theData[["BEGIN"]])
        END - as.numeric(theData[["END"]])
    
        curId - ID[[1L]]
        curBEGIN - BEGIN[[1L]]
        curEND - END[[1L]]
    
    
    
        out.1 - integer(length = theLength)
        out.2 - out.3 - numeric(length = theLength)
    
        j - 1L
    
        for(i in 2:nrow(theData)) {
            nextId - ID[[i]]
            nextBEGIN - BEGIN[[i]]
            nextEND - END[[i]]
    
            if (curId != nextId | (curEND + 1)  nextBEGIN) {
                out.1[[j]] - curId
                out.2[[j]] - curBEGIN
                out.3[[j]] - curEND
    
                j - j + 1L
    
                curId - nextId
                curBEGIN - nextBEGIN
                curEND - nextEND
            } else {
                curEND - max(curEND, nextEND, na.rm = TRUE)
            }
        }
    
        out.1[[j]] - curId
        out.2[[j]] - curBEGIN
        out.3[[j]] - curEND
    
        theOutput - data.frame(ID = out.1[1:j], BEGIN = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))
    
        theOutput
    }
    
    data1 - smoothingEpisodes(data)
    
    data2 - transform(data1, TAGE = (as.numeric(data1$END - data1$BEGIN) + 1))
    
    write.csv2(data2, file = "</path/to/output.csv>")
    

    You can find a detailed discussion on this R-Code here: "smoothing" time data - can it be done more efficient?

    0 讨论(0)
  • 2021-01-14 08:16

    Regarding your second concern, I'm not sure about PostgreSQL, but in SQL Server there's a DATEDIFF(interval, start_date, end_date) that gives you the interval specified between two dates. You could use the MIN(Begin) as a start date and MAX(End) as end date to get the interval difference. You could then use this in a case statement to output something, although you might be needing to make a sub-query or something equivalent for your scenario.

    0 讨论(0)
  • 2021-01-14 08:26

    Edit: That is great news that your DBA agreed to upgrade to a newer version of PostgreSQL. The windowing functions alone make the upgrade a worthwhile investment.

    My original answer—as you note—has a major flaw: a limitation of one row per id.
    Below is a better solution without such a limitation.
    I have tested it using test tables on my system (8.4).

    If / when you get a moment I would like to know how it performs on your data.
    I also wrote up an explanation here: https://www.mechanical-meat.com/1/detail

    WITH RECURSIVE t1_rec ( id, "begin", "end", n ) AS (
        SELECT id, "begin", "end", n
          FROM (
            SELECT
                id, "begin", "end",
                CASE 
                    WHEN LEAD("begin") OVER (
                    PARTITION BY    id
                    ORDER BY        "begin") <= ("end" + interval '2' day) 
                    THEN 1 ELSE 0 END AS cl,
                ROW_NUMBER() OVER (
                    PARTITION BY    id
                    ORDER BY        "begin") AS n
            FROM mytable 
        ) s
        WHERE s.cl = 1
      UNION ALL
        SELECT p1.id, p1."begin", p1."end", a.n
          FROM t1_rec a 
               JOIN mytable p1 ON p1.id = a.id
           AND p1."begin" > a."begin"
           AND (a."begin",  a."end" + interval '2' day) OVERLAPS 
               (p1."begin", p1."end")
    )
    SELECT t1.id, min(t1."begin"), max(t1."end")
      FROM t1_rec t1
           LEFT JOIN t1_rec t2 ON t1.id = t2.id 
           AND t2."end" = t1."end"
           AND t2.n < t1.n
     WHERE t2.n IS NULL
     GROUP BY t1.id, t1.n
     ORDER BY t1.id, t1.n;
    

    Original (deprecated) answer follows;
    note: limitation of one row per id.


    Denis is probably right about using lead() and lag(), but there is yet another way!
    You can also solve this problem using so-called recursive SQL.
    The overlaps function also comes in handy.

    I have fully tested this solution on my system (8.4).
    It works well.

    WITH RECURSIVE rec_stmt ( id, begin, end ) AS (
        /* seed statement: 
               start with only first start and end dates for each id 
        */
          SELECT id, MIN(begin), MIN(end)
            FROM mytable seed_stmt
        GROUP BY id
    
        UNION ALL
    
        /* iterative (not really recursive) statement: 
               append qualifying rows to resultset 
        */
          SELECT t1.id, t1.begin, t1.end
            FROM rec_stmt r
                 JOIN mytable t1 ON t1.id = r.id
             AND t1.begin > r.end
             AND (r.begin, r.end + INTERVAL '1' DAY) OVERLAPS 
                 (t1.begin - INTERVAL '1' DAY, t1.end)
    )
      SELECT MIN(begin), MAX(end) 
        FROM rec_stmt
    GROUP BY id;
    
    0 讨论(0)
  • 2021-01-14 08:33

    I'm not making full sense of your question, but I'm absolutely certain that you need to look into the lead()/lag() window functions.

    Something like this, for instance, will be a good starting point to place in a subquery or a common table expression, in order to identify whether rows overlap or not per id:

    select id,
           lag(start) over w as prev_start,
           lag(end) over w as prev_end,
           start,
           end,
           lead(start) over w as next_start,
           lead(end) over w as next_end
    from yourtable
    window w as (
           partition by id
           order by start, end
           )
    
    0 讨论(0)
提交回复
热议问题