Proper idiom for adding zero count rows in tidyr/dplyr

后端 未结 5 991
旧巷少年郎
旧巷少年郎 2020-11-27 16:02

Suppose I have some count data that looks like this:

library(tidyr)
library(dplyr)

X.raw <- data.frame(
    x = as.factor(c(\"A\", \"A\", \"A\", \"B\", \         


        
相关标签:
5条回答
  • 2020-11-27 16:38

    The complete function from tidyr is made for just this situation.

    From the docs:

    This is a wrapper around expand(), left_join() and replace_na that's useful for completing missing combinations of data.

    You could use it in two ways. First, you could use it on the original dataset before summarizing, "completing" the dataset with all combinations of x and y, and filling z with 0 (you could use the default NA fill and use na.rm = TRUE in sum).

    X.raw %>% 
        complete(x, y, fill = list(z = 0)) %>% 
        group_by(x,y) %>% 
        summarise(count = sum(z))
    
    Source: local data frame [4 x 3]
    Groups: x [?]
    
           x      y count
      <fctr> <fctr> <dbl>
    1      A      i     1
    2      A     ii     5
    3      B      i    15
    4      B     ii     0
    

    You can also use complete on your pre-summarized dataset. Note that complete respects grouping. X.tidy is grouped, so you can either ungroup and complete the dataset by x and y or just list the variable you want completed within each group - in this case, y.

    # Complete after ungrouping
    X.tidy %>% 
        ungroup %>%
        complete(x, y, fill = list(count = 0))
    
    # Complete within grouping
    X.tidy %>% 
        complete(y, fill = list(count = 0))
    

    The result is the same for each option:

    Source: local data frame [4 x 3]
    
           x      y count
      <fctr> <fctr> <dbl>
    1      A      i     1
    2      A     ii     5
    3      B      i    15
    4      B     ii     0
    
    0 讨论(0)
  • 2020-11-27 16:41

    plyr has the functionality you're looking for, but dplyr doesn't (yet), so you need some extra code to include the zero-count groups, as shown by @momeara. Also see this question. In plyr::ddply you just add .drop=FALSE to keep zero-count groups in the final result. For example:

    library(plyr)
    
    X.tidy = ddply(X.raw, .(x,y), summarise, count=sum(z), .drop=FALSE)
    
    X.tidy
      x  y count
    1 A  i     1
    2 A ii     5
    3 B  i    15
    4 B ii     0
    
    0 讨论(0)
  • 2020-11-27 16:44

    Since dplyr 0.8 you can do it by setting the parameter .drop = FALSE in group_by:

    X.tidy <- X.raw %>% group_by(x, y, .drop = FALSE) %>% summarise(count=sum(z))
    X.tidy
    # # A tibble: 4 x 3
    # # Groups:   x [2]
    #   x     y     count
    #   <fct> <fct> <int>
    # 1 A     i         1
    # 2 A     ii        5
    # 3 B     i        15
    # 4 B     ii        0
    
    0 讨论(0)
  • 2020-11-27 16:49

    You can use tidyr's expand to make all combinations of levels of factors, and then left_join:

    X.tidy %>% expand(x, y) %>% left_join(X.tidy)
    
    # Joining by: c("x", "y")
    # Source: local data frame [4 x 3]
    # 
    #   x  y count
    # 1 A  i     1
    # 2 A ii     5
    # 3 B  i    15
    # 4 B ii    NA
    

    Then you may keep values as NAs or replace them with 0 or any other value. That way isn't a complete solution of the problem too, but it's faster and more RAM-friendly than spread & gather.

    0 讨论(0)
  • 2020-11-27 16:54

    You could explicitly make all possible combinations and then joining it with the tidy summary:

    x.fill <- expand.grid(x=unique(x.tidy$x), x=unique(x.tidy$y)) %>%
        left_join(x.tidy, by=("x", "y")) %>%
        mutate(count = ifelse(is.na(count), 0, count)) # replace null values with 0's
    
    0 讨论(0)
提交回复
热议问题