Compute rolling sum by id variables, with missing timepoints

后端未结

关注

 4  1928

I\'m trying to learn R and there are a few things I\'ve done for 10+ years in SAS that I cannot quite figure out the best way to do in R. Take this data:

id


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  甜味超标        
                
              
                            
                2020-12-14 03:57
              
            
            
                                                                       
A farily efficient answer to this problem could be found using the data.table library.

##Utilize the data.table package
library("data.table")
data <- data.table(t,class,id,count,desired)[order(id,class)]

##Assign each customer an ID
data[,Cust_No:=.GRP,by=c("id","class")]

##Create "list" of comparison dates and values
Ref <- data[,list(Compare_Value=list(I(count)),Compare_Date=list(I(t))), by=c("id","class")]

##Compare two lists and see of the compare date is within N days
data$Roll.Val <- mapply(FUN = function(RD, NUM) {
  d <- as.numeric(Ref$Compare_Date[[NUM]] - RD)
  sum((d <= 0 & d >= -124)*Ref$Compare_Value[[NUM]])
}, RD = data$t,NUM=data$Cust_No)

##Print out data
data <- data[,list(id,class,t,count,desired,Roll.Val)][order(id,class)]
data

id class          t count desired Roll.Val
1:  1     A 2010-01-15     1       1        1
2:  1     A 2010-02-15     2       3        3
3:  1     B 2010-04-15     3       3        3
4:  1     B 2010-09-15     4       4        4
5:  2     A 2010-01-15     5       5        5
6:  2     B 2010-06-15     6       6        6
7:  2     B 2010-08-15     7      13       13
8:  2     B 2010-09-15     8      21       21

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  野性不改        
                
              
                            
                2020-12-14 04:00
              
            
            
                                                                       
I'm almost embarrassed to post this.  I'm usually pretty good as these, but there's got to be a better way.  

This first uses zoo's as.yearmon to get the dates in terms of just month and year, then reshapes it to get one column for each id/class combination, then fills in with zeros before, after, and for missing months, then uses zoo to get the rolling sum, then pulls out just the desired months and merges back with the original data frame.

library(reshape2)
library(zoo)
df$yearmon <- as.yearmon(df$t)
dfa <- dcast(id + class ~ yearmon, data=df, value.var="count")
ida <- dfa[,1:2]
dfa <- t(as.matrix(dfa[,-c(1:2)]))
months <- with(df, seq(min(yearmon)-3/12, max(yearmon)+3/12, by=1/12))
dfb <- array(dim=c(length(months), ncol(dfa)), 
             dimnames=list(paste(months), colnames(dfa)))
dfb[rownames(dfa),] <- dfa
dfb[is.na(dfb)] <- 0
dfb <- rollsumr(dfb,4, fill=0)
rownames(dfb) <- paste(months)
dfb <- dfb[rownames(dfa),]
dfc <- cbind(ida, t(dfb))
dfc <- melt(dfc, id.vars=c("class", "id"))
names(dfc)[3:4] <- c("yearmon", "desired2")
dfc$yearmon <- as.yearmon(dfc$yearmon)
out <- merge(df,dfc)

> out
  id class  yearmon          t count desired desired2
1  1     A Feb 2010 2010-02-15     2       3        3
2  1     A Jan 2010 2010-01-15     1       1        1
3  1     B Apr 2010 2010-04-15     3       3        3
4  1     B Sep 2010 2010-09-15     4       4        4
5  2     A Jan 2010 2010-01-15     5       5        5
6  2     B Aug 2010 2010-08-15     7      13       13
7  2     B Jun 2010 2010-06-15     6       6        6
8  2     B Sep 2010 2010-09-15     8      21       21

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  栀梦        
                
              
                            
                2020-12-14 04:05
              
            
            
                                                                       
Here are a few solutions:

1) zoo Using ave, for each group create a monthly series, m, by merging the original series, z, with a grid, g.  Then calculate the rolling sum and retain only the original time points:

library(zoo)
f <- function(i) { 
    z <- with(df[i, ], zoo(count, t))
    g <- zoo(, seq(start(z), end(z), by = "month"))
    m <- merge(z, g)
    window(rollapplyr(m, 4, sum, na.rm = TRUE, partial = TRUE), time(z))
}
df$desired <- ave(1:nrow(df), df$id, df$class, FUN = f)


which gives:

> df
  id class          t count desired
1  1     A 2010-01-15     1       1
2  1     A 2010-02-15     2       3
3  1     B 2010-04-15     3       3
4  1     B 2010-09-15     4       4
5  2     A 2010-01-15     5       5
6  2     B 2010-06-15     6       6
7  2     B 2010-08-15     7      13
8  2     B 2010-09-15     8      21


Note We have assumed the times are ordered within each group (as in the question).  If that is not so then sort df first.

2) sqldf

library(sqldf)
sqldf("select id, class, a.t, a.'count', sum(b.'count') desired 
   from df a join df b 
   using(id, class) 
   where a.t - b.t between 0 and 100
   group by id, class, a.t")


which gives:

  id class          t count desired
1  1     A 2010-01-15     1       1
2  1     A 2010-02-15     2       3
3  1     B 2010-04-15     3       3
4  1     B 2010-09-15     4       4
5  2     A 2010-01-15     5       5
6  2     B 2010-06-15     6       6
7  2     B 2010-08-15     7      13
8  2     B 2010-09-15     8      21


Note:  If the merge should be too large to fit into memory then use sqldf("...", dbname = tempfile()) to cause the intermediate results to be stored in a database which it creates on the fly and automatically destroys afterwards.

3) Base R The sqldf solution motivates this base R solution which just translates the SQL into R:

m <- merge(df, df, by = 1:2)
s <- subset(m, t.x - t.y >= 0 & t.x - t.y <= 100)
ag <- aggregate(count.y ~ t.x + class + id, s, sum)
names(ag) <- c("t", "class", "id", "count", "desired")


The result is:

> ag
           t class id count desired
1 2010-01-15     A  1     1       1
2 2010-02-15     A  1     2       3
3 2010-04-15     B  1     3       3
4 2010-09-15     B  1     4       4
5 2010-01-15     A  2     5       5
6 2010-06-15     B  2     6       6
7 2010-08-15     B  2     7      13
8 2010-09-15     B  2     8      21


Note: This does do a merge in memory which might be a problem if the data set is very large.

UPDATE: Minor simplifications of first solution and also added second solution.

UPDATE 2: Added third solution.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  半阙折子戏        
                
              
                            
                2020-12-14 04:21
              
            
            
                                                                       
With runner package one can calculate everything on rolling windows. Below example of using sum_run

library(runner)
df %>%
  group_by(id) %>%
  mutate(
    output = sum_run(count, k = 30*4, idx = t)   
  )

# <dbl> <fct> <date>     <dbl>   <dbl>  <dbl>
#     1 A     2010-01-15     1       1      1
#     1 A     2010-02-15     2       3      3
#     1 B     2010-04-15     3       3      6
#     1 B     2010-09-15     4       4      4
#     2 A     2010-01-15     5       5      5
#     2 B     2010-06-15     6       6      6
#     2 B     2010-08-15     7      13     13
#     2 B     2010-09-15     8      21     21

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复