Stepping through a pipeline with intermediate results

前端 未结 5 1572
梦谈多话
梦谈多话 2021-01-01 19:50

Is there a way to output the result of a pipeline at each step without doing it manually? (eg. without selecting and running only the selected chunks)

I ofte

相关标签:
5条回答
  • 2021-01-01 20:39

    It is easy with magrittr function chain. For example define a function my_chain with:

    foo <- function(x) x + 1
    bar <- function(x) x + 1
    baz <- function(x) x + 1
    my_chain <- . %>% foo %>% bar %>% baz
    

    and get the final result of a chain as:

         > my_chain(0)
        [1] 3
    

    You can get a function list with functions(my_chain) and define a "stepper" function like this:

    stepper <- function(fun_chain, x, FUN = print) {
      f_list <- functions(fun_chain)
      for(i in seq_along(f_list)) {
        x <- f_list[[i]](x)
        FUN(x)
      }
      invisible(x)
    }
    

    And run the chain with interposed print function:

    stepper(my_chain, 0, print)
    
    # [1] 1
    # [1] 2
    # [1] 3
    

    Or with waiting for user input:

    stepper(my_chain, 0, function(x) {print(x); readline()})
    
    0 讨论(0)
  • 2021-01-01 20:39

    I wrote the package pipes that can do several things that might help :

    • use %P>% to print the output.
    • use %ae>% to use all.equal on input and output.
    • use %V>% to use View on the output, it will open a viewer for each relevant step.

    If you want to see some aggregated info you can try %summary>%, %glimpse>% or %skim>% which will use summary, tibble::glimpse or skimr::skim, or you can define your own pipe to show specific changes, using new_pipe

    # devtools::install_github("moodymudskipper/pipes")
    library(dplyr)
    library(pipes)
    
    res <- mtcars %P>% 
      group_by(cyl) %P>% 
      sample_frac(0.1) %P>% 
      summarise(res = mean(mpg))
    #> group_by(., cyl)
    #> # A tibble: 32 x 11
    #> # Groups:   cyl [3]
    #>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    #>  * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    #>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
    #>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
    #>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
    #>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
    #>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
    #>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
    #>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
    #>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
    #>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
    #> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
    #> # ... with 22 more rows
    #> sample_frac(., 0.1)
    #> # A tibble: 3 x 11
    #> # Groups:   cyl [3]
    #>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    #>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    #> 1  26       4  120.    91  4.43  2.14  16.7     0     1     5     2
    #> 2  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
    #> 3  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
    #> summarise(., res = mean(mpg))
    #> # A tibble: 3 x 2
    #>     cyl   res
    #>   <dbl> <dbl>
    #> 1     4  26  
    #> 2     6  17.8
    #> 3     8  18.7
    
    res <- mtcars %ae>% 
      group_by(cyl) %ae>% 
      sample_frac(0.1) %ae>% 
      summarise(res = mean(mpg))
    #> group_by(., cyl)
    #> [1] "Attributes: < Names: 1 string mismatch >"                                              
    #> [2] "Attributes: < Length mismatch: comparison on first 2 components >"                     
    #> [3] "Attributes: < Component \"class\": Lengths (1, 4) differ (string compare on first 1) >"
    #> [4] "Attributes: < Component \"class\": 1 string mismatch >"                                
    #> [5] "Attributes: < Component 2: Modes: character, list >"                                   
    #> [6] "Attributes: < Component 2: Lengths: 32, 2 >"                                           
    #> [7] "Attributes: < Component 2: names for current but not for target >"                     
    #> [8] "Attributes: < Component 2: Attributes: < target is NULL, current is list > >"          
    #> [9] "Attributes: < Component 2: target is character, current is tbl_df >"
    #> sample_frac(., 0.1)
    #> [1] "Different number of rows"
    #> summarise(., res = mean(mpg))
    #> [1] "Cols in y but not x: `res`. "                                                                
    #> [2] "Cols in x but not y: `qsec`, `wt`, `drat`, `hp`, `disp`, `mpg`, `carb`, `gear`, `am`, `vs`. "
    
    res <- mtcars %V>% 
      group_by(cyl) %V>% 
      sample_frac(0.1) %V>% 
      summarise(res = mean(mpg))
    # you'll have to test this one by yourself
    
    0 讨论(0)
  • 2021-01-01 20:41

    IMHO magrittr is mostly useful interactively, that is when I am exploring data or building a new formula/model.

    In this cases, storing intermediate results in distinct variables is very time consuming and distracting, while pipes let me focus on data, rather than typing:

    x %>% foo
    ## reason on results and 
    x %>% foo %>% bar
    ## reason on results and 
    x %>% foo %>% bar %>% baz
    ## etc.
    

    The problem here is that I don't know in advance what the final pipe will be, like in @bergant.

    Typing, as in @zx8754,

    x %>% print %>% foo %>% print %>% bar %>% print %>% baz
    

    adds to much overhead and, to me, defeats the whole purpose of magrittr.

    Essentially magrittr lacks a simple operator that both prints and pipes results.
    The good news is that it seems quite easy to craft one:

    `%P>%`=function(lhs, rhs){ print(lhs); lhs %>% rhs }
    

    Now you can print an pipe:

    1:4 %P>% sqrt %P>% sum 
    ## [1] 1 2 3 4
    ## [1] 1.000000 1.414214 1.732051 2.000000
    ## [1] 6.146264
    

    I found that if one defines/uses a key bindings for %P>% and %>%, the prototyping workflow is very streamlined (see Emacs ESS or RStudio).

    0 讨论(0)
  • 2021-01-01 20:42

    You can select which results to print by using the tee-operator (%T>%) and print(). The tee-operator is used exclusively for side-effects like printing.

    # i.e.
    mtcars %>%
      group_by(cyl) %T>% print() %>%
      sample_frac(0.1) %T>% print() %>%
      summarise(res = mean(mpg))
    
    0 讨论(0)
  • 2021-01-01 20:46

    Add print:

    mtcars %>% 
      group_by(cyl) %>% 
      print %>% 
      sample_frac(0.1) %>% 
      print %>% 
      summarise(res = mean(mpg))
    
    0 讨论(0)
提交回复
热议问题