When to use “Do” function in dplyr

后端 未结 1 715
谎友^
谎友^ 2020-12-24 15:06

I\'ve learned that Do function is used when you want to apply a function to each group.

for example, if I want to pull top 2 rows from \"A\", \"C\", and

相关标签:
1条回答
  • 2020-12-24 15:32

    The comments under the question discuss that in many cases you can find an alternative in dplyr or associated packages that avoid the use of do and the examples in the question are of that sort; however, to answer the question directly rather than via alternatives:

    Differences between using do and not using it

    Within the context of data frames, the key differences between using do and not using do are:

    1. No automatic insertion of dot The code within the do will not have dot automatically inserted into the first argument. For example, instead of the do(summarise(Mean_2014 = mean(Y2014))) code in the question one would have to write do(summarise(., Mean_2014 = mean(Y2014))) with a dot since the dot is not automatically inserted. This is a consequence of do being the right hand side function of %>% rather than summarize. Although this is important to understand so that we insert dot when needed if the purpose were simply to avoid automatic insertion of dot into the first argument we could alternately use brace brackets to get that effect: whatever %>% { myfun(arg1, arg2) } would also not automatically insert dot as the first argument of the myfun call.

    2. respecting group_by Only functions specifically written to respect group_by will do so. There are two issues here. (1) Only functions specifically written to respect group_by will be run once for each group. mutate, summarize and do are examples of functions that run once per group (there are others too). (2) Even if the function is run once for each group there is the question of how dot is handled. We focus on two cases (not a complete list): (i) if do is not used then if dot is used within a function call within an expression to an argument it will refer to the entire input ignoring group_by. Presumably this is a consequence of magrittr's dot substitution rules and it not knowing anything about group_by. On the other hand (ii) within do dot always refers to the rows of the current group. For example, compare the output of these two and note that dot refers to 3 rows in the first case where do is used and all 6 rows in the second where it is not. This is despite the fact that summarize respects group_by in that it runs once per group.

      BOD$g <- c(1, 1, 1, 2, 2, 2)
      BOD %>% group_by(g) %>% do(summarize(., nr = nrow(.)))
      ## # A tibble: 2 x 2
      ## # Groups: g [2]
      ##       g    nr
      ##   <dbl> <int>
      ## 1  1.00     3
      ## 2  2.00     3
      
      BOD %>% group_by(g) %>% summarize(nr = nrow(.))
      ## # A tibble: 2 x 2
      ##       g    nr
      ##   <dbl> <int>
      ## 1  1.00     6
      ## 2  2.00     6
      

    See ?do for more information.

    Code from Question

    Now we go through the code in the question. As mydata was never defined in the question we use the first line of code below to define it to facilitate concrete examples.

    mydata <- data.frame(Index = rep(c("A", "C", "I"), each = 3), Y2014 = 1)
    
    mydata %>% 
           filter(Index %in% c("A", "C", "I")) %>% 
           group_by(Index) %>% 
           do(head(., 2))
    
    ## # A tibble: 6 x 2
    ## # Groups: Index [3]
    ##   Index  Y2014
    ##   <fctr> <dbl>
    ## 1 A       1.00
    ## 2 A       1.00
    ## 3 C       1.00
    ## 4 C       1.00
    ## 5 I       1.00
    ## 6 I       1.00
    

    The code above produces 2 rows for each of the 3 groups giving 6 rows. Had we omitted do then it would disregard group_by and produce only two rows with dot being regarded as the entire 9 rows of input, not just each group at a time. (In this particular case dplyr provides its own alternative to head that avoids these problems but for sake of illustrating the general point we stick to the code in the question.)

    The following code from the question generates an error because dot insertion is not done within do and so what ought to be the first argument of summarize, i.e. the data frame input, is missing:

    mydata %>% 
           group_by(Index) %>% 
           do(summarise(Mean_2014 = mean(Y2014)))
    ## Error in mean(Y2014) : object 'Y2014' not found
    

    If we remove the do in the above code, as in the last line of code in the question, then it works since the dot insertion is performed. Alternately if we add the dot do(summarise(., Mean_2014 = mean(Y2014))) it would also work although do really seems superfluous in this case as summarize already respects group_by so there is no need to wrap it in do.

    mydata %>% 
           group_by(Index) %>% 
           summarise(Mean_2014 = mean(Y2014))
    
    ## # A tibble: 3 x 2
    ##   Index  Mean_2014
    ##   <fctr>     <dbl>
    ## 1 A           1.00
    ## 2 C           1.00
    ## 3 I           1.00
    
    0 讨论(0)
提交回复
热议问题