R- Calculate a count of items over time using start and end dates

后端未结

关注

 5  1247

I want to calculate a count of items over time using their Start and End dates.

Some sample data

START <- as.Date(c(\"2014-01-01\", \"2014-01-02\


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  暗喜        
                
              
                            
                2021-01-05 17:59
              
            
            
                                                                       
Using dplyr and grouped data:

data_frame(
            START = as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03")),
            END   = as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
           ) -> df
rbind(cbind(group = 'a', df),cbind(group = 'b', df)) %>% as_data_frame->df
df

df %>% 
  group_by(.,group) %>% 
  do(data.frame(table(Reduce(c, Map(seq, .$START, .$END, by = 1)))))


This is a common problem when you for example want to find the number of logins on different pages/machines etc given time-intervals per users

> df
Source: local data frame [8 x 3]

  group      START        END
  (chr)     (date)     (date)
1     a 2014-01-01 2014-01-04
2     a 2014-01-02 2014-01-03
3     a 2014-01-03 2014-01-03
4     a 2014-01-03 2014-01-04
5     b 2014-01-01 2014-01-04
6     b 2014-01-02 2014-01-03
7     b 2014-01-03 2014-01-03
8     b 2014-01-03 2014-01-04
> 
> df %>% 
+   group_by(.,group) %>% 
+   do(data.frame(table(Reduce(c, Map(seq, .$START, .$END, by = 1)))))
Source: local data frame [8 x 3]
Groups: group [2]

  group       Var1  Freq
  (chr)     (fctr) (int)
1     a 2014-01-01     1
2     a 2014-01-02     2
3     a 2014-01-03     4
4     a 2014-01-04     2
5     b 2014-01-01     1
6     b 2014-01-02     2
7     b 2014-01-03     4
8     b 2014-01-04     2

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  佛祖请我去吃肉        
                
              
                            
                2021-01-05 18:05
              
            
            
                                                                       
This would do it. You can change the column names as necessary.

as.data.frame(table(Reduce(c, Map(seq, df$START, df$END, by = 1))))
#         Var1 Freq
# 1 2014-01-01    1
# 2 2014-01-02    2
# 3 2014-01-03    4
# 4 2014-01-04    2


As noted in the comments, Var1 in the above solution is now a factor, and not a date.  To keep the date class in the first column, you could do some more work to the above solution, or use plyr::count instead of as.data.frame(table(...))

library(plyr)
count(Reduce(c, Map(seq, df$START, df$END, by = 1)))
#            x freq
# 1 2014-01-01    1
# 2 2014-01-02    2
# 3 2014-01-03    4
# 4 2014-01-04    2

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  野的像风        
                
              
                            
                2021-01-05 18:09
              
            
            
                                                                       
You could use data.table 

library(data.table)
DT <- setDT(df)[, list(DATETIME= seq(START, END, by=1)), by=1:nrow(df)][,
                           list(COUNT=.N), by=DATETIME]
 DT
 #     DATETIME COUNT
 #1: 2014-01-01     1
 #2: 2014-01-02     2
 #3: 2014-01-03     4
 #4: 2014-01-04     2




From version 1.9.4+, you can also use the function foverlaps() to do an "overlap join". It's more efficient as it doesn't have to expand the dates for each row first, and then count. Here's how:

require(data.table) ## 1.9.4
setDT(df) ## convert your data.frame to data.table by reference

## 1. Some preprocessing:
# create a lookup - the dates for which you need the count, and set key
dates = seq(as.Date("2014-01-01"), as.Date("2014-01-04"), by="days")
lookup = data.table(START=dates, END=dates, key=c("START", "END"))

## 2. Now find overlapping coordinates 
# for each row in `df` get all the rows it overlaps with in `lookup`
ans = foverlaps(df, lookup, type="any", which=TRUE)


Now, we just have to group by yid (= indices in lookup) and count:

## 3. count
ans[, .N, by=yid]
#    yid N
# 1:   1 1
# 2:   2 2
# 3:   3 4
# 4:   4 2


The first column corresponds to the row numbers in lookup. If some numbers are missing, then the count  is 0 for them.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  名媛妹妹        
                
              
                            
                2021-01-05 18:20
              
            
            
                                                                       
I just proposed another lubridate-based solution that's faster for larger dataframes with wide date ranges in a newer and related SO post here
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  无人及你        
                
              
                            
                2021-01-05 18:22
              
            
            
                                                                       
Using dplyr and foreach:

library(dplyr)
library(foreach)

df <- data.frame(START = as.Date(c("2014-01-01",
                                   "2014-01-02",
                                   "2014-01-03",
                                   "2014-01-03")),
                 END = as.Date(c("2014-01-04",
                                 "2014-01-03",
                                 "2014-01-03",
                                 "2014-01-04")))
df

r <- foreach(DATETIME = seq(min(df$START), max(df$END), by = 1),
             .combine = rbind) %do% {
  df %>%
    filter(DATETIME >= START & DATETIME <= END) %>%
    summarise(DATETIME, COUNT = n())
}
r

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复