Cumulative sums over run lengths. Can this loop be vectorized?

后端未结

关注

 3  1134

I have a data frame on which I calculate a run length encoding for a specific column. The values of the column, dir, are either -1, 0, or 1.

dir.r


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  野性不改        
                
              
                            
                2020-12-18 11:03
              
            
            
                                                                       
This can be broken down into a two step problem. First, if we create an indexing column based off of the rle, then we can use that to group by and run the cumsum. The group by can then be performed by any number of aggregation techniques. I'll show two options, one using data.table and the other using plyr. 

library(data.table)
library(plyr)
#data.table is the same thing as a data.frame for most purposes
#Fake data
dat <- data.table(dir = sample(-1:1, 20, TRUE), value = rnorm(20))
dir.rle <- rle(dat$dir)
#Compute an indexing column to group by
dat <- transform(dat, indexer = rep(1:length(dir.rle$lengths), dir.rle$lengths))


#What does the indexer column look like?
> head(dat)
     dir      value indexer
[1,]   1  0.5045807       1
[2,]   0  0.2660617       2
[3,]   1  1.0369641       3
[4,]   1 -0.4514342       3
[5,]  -1 -0.3968631       4
[6,]  -1 -2.1517093       4


#data.table approach
dat[, cumsum(value), by = indexer]

#plyr approach
ddply(dat, "indexer", summarize, V1 = cumsum(value))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光说笑        
                
              
                            
                2020-12-18 11:06
              
            
            
                                                                       
Add a 'group' column to the data frame. Something like:

df=data.frame(z=rnorm(100)) # dummy data
df$dir = sign(df$z) # dummy +/- 1
rl = rle(df$dir)
df$group = rep(1:length(rl$lengths),times=rl$lengths)


then use tapply to sum within groups:

tapply(df$z,df$group,sum)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  太阳男子        
                
              
                            
                2020-12-18 11:14
              
            
            
                                                                       
Both Spacedman & Chase make the key point that a grouping variable simplifies everything (and Chase lays out two nice ways to proceed from there).  

I'll just throw in an alternative approach to forming that grouping variable. It doesn't use rle and, at least to me, feels more intuitive. Basically, at each point where diff() detects a change in value, the cumsum that will form your grouping variable is incremented by one:

df$group <- c(0, cumsum(!(diff(df$dir)==0)))

# Or, equivalently
df$group <- c(0, cumsum(as.logical(diff(df$dir))))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复