merge DATE-rows if episodes are in direct succession or overlapping

后端未结

关注

 4  862

I have a table like this:

ID    BEGIN    END

If there are overlapping episodes for the same ID (like 2000-01-01 - 2001-1


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  忘了有多久        
                
              
                            
                2021-01-14 08:15
              
            
            
                                                                       
Pure SQL

For a pure SQL-solution, look at Adam's post and read this article this article (it is written in french, however you will find out it's not too hard to read). The article was recommended to me after consulting the postgresql-mailing-list (thank you for that!).

For my data this was not suitable because all possible solutions need to self join a table at least 3 times. This turns out to be a problem for (very) large amounts of data.

Semi SQL, Semi imperative Language

If you primarily care about speed and you have the possibility to use an imperative language, you can get much faster (depending on the amount of data, of course). In my case the task performed (at least) 1.000 times faster, using R.

Steps:

(1) Get a .csv-file. Take care of sorting!!!

COPY (
  SELECT "ID", "BEGIN", "END"
  <sorry, for a reason I don't know StackOverflow won't let me finish my code here...>


(2) Do something like this (this code is R, but you could do something similar in any imperative language):

data - read.csv2("</path/to.csv>")
data$BEGIN - as.Date(data$BEGIN)
data$END - as.Date(data$END)

smoothingEpisodes - function (theData) {

    theLength - nrow(theData)
    if (theLength  2L) return(theData)

    ID - as.integer(theData[["ID"]])
    BEGIN - as.numeric(theData[["BEGIN"]])
    END - as.numeric(theData[["END"]])

    curId - ID[[1L]]
    curBEGIN - BEGIN[[1L]]
    curEND - END[[1L]]



    out.1 - integer(length = theLength)
    out.2 - out.3 - numeric(length = theLength)

    j - 1L

    for(i in 2:nrow(theData)) {
        nextId - ID[[i]]
        nextBEGIN - BEGIN[[i]]
        nextEND - END[[i]]

        if (curId != nextId | (curEND + 1)  nextBEGIN) {
            out.1[[j]] - curId
            out.2[[j]] - curBEGIN
            out.3[[j]] - curEND

            j - j + 1L

            curId - nextId
            curBEGIN - nextBEGIN
            curEND - nextEND
        } else {
            curEND - max(curEND, nextEND, na.rm = TRUE)
        }
    }

    out.1[[j]] - curId
    out.2[[j]] - curBEGIN
    out.3[[j]] - curEND

    theOutput - data.frame(ID = out.1[1:j], BEGIN = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))

    theOutput
}

data1 - smoothingEpisodes(data)

data2 - transform(data1, TAGE = (as.numeric(data1$END - data1$BEGIN) + 1))

write.csv2(data2, file = "</path/to/output.csv>")


You can find a detailed discussion on this R-Code here:
"smoothing" time data - can it be done more efficient?
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  独厮守ぢ        
                
              
                            
                2021-01-14 08:16
              
            
            
                                                                       
Regarding your second concern, I'm not sure about PostgreSQL, but in SQL Server there's a DATEDIFF(interval, start_date, end_date) that gives you the interval specified between two dates. You could use the MIN(Begin) as a start date and MAX(End) as end date to get the interval difference. You could then use this in a case statement to output something, although you might be needing to make a sub-query or something equivalent for your scenario.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  無奈伤痛        
                
              
                            
                2021-01-14 08:26
              
            
            
                                                                       
Edit:  That is great news that your DBA agreed to upgrade to a newer version of PostgreSQL. The windowing functions alone make the upgrade a worthwhile investment.  

My original answer—as you note—has a major flaw: a limitation of one row per id.

Below is a better solution without such a limitation.

I have tested it using test tables on my system (8.4).  

If / when you get a moment I would like to know how it performs on your data.

I also wrote up an explanation here: https://www.mechanical-meat.com/1/detail

WITH RECURSIVE t1_rec ( id, "begin", "end", n ) AS (
    SELECT id, "begin", "end", n
      FROM (
        SELECT
            id, "begin", "end",
            CASE 
                WHEN LEAD("begin") OVER (
                PARTITION BY    id
                ORDER BY        "begin") <= ("end" + interval '2' day) 
                THEN 1 ELSE 0 END AS cl,
            ROW_NUMBER() OVER (
                PARTITION BY    id
                ORDER BY        "begin") AS n
        FROM mytable 
    ) s
    WHERE s.cl = 1
  UNION ALL
    SELECT p1.id, p1."begin", p1."end", a.n
      FROM t1_rec a 
           JOIN mytable p1 ON p1.id = a.id
       AND p1."begin" > a."begin"
       AND (a."begin",  a."end" + interval '2' day) OVERLAPS 
           (p1."begin", p1."end")
)
SELECT t1.id, min(t1."begin"), max(t1."end")
  FROM t1_rec t1
       LEFT JOIN t1_rec t2 ON t1.id = t2.id 
       AND t2."end" = t1."end"
       AND t2.n < t1.n
 WHERE t2.n IS NULL
 GROUP BY t1.id, t1.n
 ORDER BY t1.id, t1.n;


Original (deprecated) answer follows;

note: limitation of one row per id.



Denis is probably right about using lead() and lag(), but there is yet another way!

You can also solve this problem using so-called recursive SQL.

The overlaps function also comes in handy.  

I have fully tested this solution on my system (8.4).

It works well.  

WITH RECURSIVE rec_stmt ( id, begin, end ) AS (
    /* seed statement: 
           start with only first start and end dates for each id 
    */
      SELECT id, MIN(begin), MIN(end)
        FROM mytable seed_stmt
    GROUP BY id

    UNION ALL

    /* iterative (not really recursive) statement: 
           append qualifying rows to resultset 
    */
      SELECT t1.id, t1.begin, t1.end
        FROM rec_stmt r
             JOIN mytable t1 ON t1.id = r.id
         AND t1.begin > r.end
         AND (r.begin, r.end + INTERVAL '1' DAY) OVERLAPS 
             (t1.begin - INTERVAL '1' DAY, t1.end)
)
  SELECT MIN(begin), MAX(end) 
    FROM rec_stmt
GROUP BY id;

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  半阙折子戏        
                
              
                            
                2021-01-14 08:33
              
            
            
                                                                       
I'm not making full sense of your question, but I'm absolutely certain that you need to look into the lead()/lag() window functions.

Something like this, for instance, will be a good starting point to place in a subquery or a common table expression, in order to identify whether rows overlap or not per id:

select id,
       lag(start) over w as prev_start,
       lag(end) over w as prev_end,
       start,
       end,
       lead(start) over w as next_start,
       lead(end) over w as next_end
from yourtable
window w as (
       partition by id
       order by start, end
       )

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复