Proper idiom for adding zero count rows in tidyr/dplyr

后端未结

关注

 5  991

Suppose I have some count data that looks like this:

library(tidyr)
library(dplyr)

X.raw <- data.frame(
    x = as.factor(c(\"A\", \"A\", \"A\", \"B\", \


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  小鲜肉        
                
              
                            
                2020-11-27 16:38
              
            
            
                                                                       
The complete function from tidyr is made for just this situation.

From the docs:


  This is a wrapper around expand(), left_join() and replace_na that's
  useful for completing missing combinations of data.


You could use it in two ways.  First, you could use it on the original dataset before summarizing, "completing" the dataset with all combinations of x and y, and filling z with 0 (you could use the default NA fill and use na.rm = TRUE in sum).

X.raw %>% 
    complete(x, y, fill = list(z = 0)) %>% 
    group_by(x,y) %>% 
    summarise(count = sum(z))

Source: local data frame [4 x 3]
Groups: x [?]

       x      y count
  <fctr> <fctr> <dbl>
1      A      i     1
2      A     ii     5
3      B      i    15
4      B     ii     0


You can also use complete on your pre-summarized dataset.  Note that complete respects grouping.  X.tidy is grouped, so you can either ungroup and complete the dataset by x and y or just list the variable you want completed within each group - in this case, y.

# Complete after ungrouping
X.tidy %>% 
    ungroup %>%
    complete(x, y, fill = list(count = 0))

# Complete within grouping
X.tidy %>% 
    complete(y, fill = list(count = 0))


The result is the same for each option:

Source: local data frame [4 x 3]

       x      y count
  <fctr> <fctr> <dbl>
1      A      i     1
2      A     ii     5
3      B      i    15
4      B     ii     0

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  长发绾君心        
                
              
                            
                2020-11-27 16:41
              
            
            
                                                                       
plyr has the functionality you're looking for, but dplyr doesn't (yet), so you need some extra code to include the zero-count groups, as shown by @momeara. Also see this question. In plyr::ddply you just add .drop=FALSE to keep zero-count groups in the final result. For example:

library(plyr)

X.tidy = ddply(X.raw, .(x,y), summarise, count=sum(z), .drop=FALSE)

X.tidy
  x  y count
1 A  i     1
2 A ii     5
3 B  i    15
4 B ii     0

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  野的像风        
                
              
                            
                2020-11-27 16:44
              
            
            
                                                                       
Since dplyr 0.8 you can do it by setting the parameter .drop = FALSE in group_by: 

X.tidy <- X.raw %>% group_by(x, y, .drop = FALSE) %>% summarise(count=sum(z))
X.tidy
# # A tibble: 4 x 3
# # Groups:   x [2]
#   x     y     count
#   <fct> <fct> <int>
# 1 A     i         1
# 2 A     ii        5
# 3 B     i        15
# 4 B     ii        0

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  日久生厌        
                
              
                            
                2020-11-27 16:49
              
            
            
                                                                       
You can use tidyr's expand to make all combinations of levels of factors, and then left_join:

X.tidy %>% expand(x, y) %>% left_join(X.tidy)

# Joining by: c("x", "y")
# Source: local data frame [4 x 3]
# 
#   x  y count
# 1 A  i     1
# 2 A ii     5
# 3 B  i    15
# 4 B ii    NA


Then you may keep values as NAs or replace them with 0 or any other value.
That way isn't a complete solution of the problem too, but it's faster and more RAM-friendly than spread & gather.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南笙        
                
              
                            
                2020-11-27 16:54
              
            
            
                                                                       
You could explicitly make all possible combinations and then joining it with the tidy summary:

x.fill <- expand.grid(x=unique(x.tidy$x), x=unique(x.tidy$y)) %>%
    left_join(x.tidy, by=("x", "y")) %>%
    mutate(count = ifelse(is.na(count), 0, count)) # replace null values with 0's

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复