Emulate split() with dplyr group_by: return a list of data frames

前端 未结 6 569
忘掉有多难
忘掉有多难 2020-11-29 06:07

I have a large dataset that chokes split() in R. I am able to use dplyr group_by (which is a preferred way anyway) but I am unable to persist the r

相关标签:
6条回答
  • 2020-11-29 06:17

    Comparing the base, plyr and dplyr solutions, it still seems the base one is much faster!

    library(plyr)
    library(dplyr)   
    
    df <- data_frame(Group1=rep(LETTERS, each=1000),
                 Group2=rep(rep(1:10, each=100),26), 
                 Value=rnorm(26*1000))
    
    microbenchmark(Base=df %>%
                 split(list(.$Group2, .$Group1)),
               dplyr=df %>% 
                 group_by(Group1, Group2) %>% 
                 do(data = (.)) %>% 
                 select(data) %>% 
                 lapply(function(x) {(x)}) %>% .[[1]],
               plyr=dlply(df, c("Group1", "Group2"), as.tbl),
               times=50) 
    

    Gives:

    Unit: milliseconds
      expr      min        lq      mean    median        uq       max neval
      Base 12.82725  13.38087  16.21106  14.58810  17.14028  41.67266    50
      dplyr 25.59038 26.66425  29.40503  27.37226  28.85828  77.16062   50
      plyr 99.52911  102.76313 110.18234 106.82786 112.69298 140.97568    50
    
    0 讨论(0)
  • 2020-11-29 06:22

    group_split in dplyr:

    Dplyr has implemented group_split: https://dplyr.tidyverse.org/reference/group_split.html

    It splits a dataframe by a groups, returns a list of dataframes. Each of these dataframes are subsets of the original dataframes defined by categories of the splitting variable.

    For example. Split the dataset iris by the variable Species, and calculate summaries of each sub-dataset:

    > iris %>% 
    +     group_split(Species) %>% 
    +     map(summary)
    [[1]]
      Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
     Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100   setosa    :50  
     1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200   versicolor: 0  
     Median :5.000   Median :3.400   Median :1.500   Median :0.200   virginica : 0  
     Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246                  
     3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300                  
     Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600                  
    
    [[2]]
      Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
     Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
     1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
     Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
     Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
     3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
     Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  
    
    [[3]]
      Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
     Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400   setosa    : 0  
     1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800   versicolor: 0  
     Median :6.500   Median :3.000   Median :5.550   Median :2.000   virginica :50  
     Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026                  
     3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300                  
     Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500     
    

    It is also very helpful for debugging a calculations on nested dataframes, because it is an quick way to "see" what is going on "inside" the calculations on nested dataframes.

    0 讨论(0)
  • 2020-11-29 06:30

    Since dplyr 0.5.0.9000, the shortest solution that uses group_by() is probably to follow do with a pull:

    df %>% group_by(V1) %>% do(data=(.)) %>% pull(data)
    

    Note that, unlike split, this doesn't name the resulting list elements. If this is desired, then you would probably want something like

    df %>% group_by(V1) %>% do(data = (.)) %>% with( set_names(data, V1) )
    

    To editorialize a little, I agree with the folks saying that split() is the better option. Personally, I always found it annoying that I have to type the name of the data frame twice (e.g., split( potentiallylongname, potentiallylongname$V1 )), but the issue is easily sidestepped with the pipe:

    df %>% split( .$V1 )
    
    0 讨论(0)
  • 2020-11-29 06:31

    You can get a list of data frames from group_by using do as long as you name the new column where the data frames will be stored and then pipe that column into lapply.

    listDf = df %>% group_by(V1) %>% do(vals=data.frame(.)) %>% select(vals) %>% lapply(function(x) {(x)})
    listDf[[1]]
    #[[1]]
    #  V1 V2 V3
    #1  a  1  2
    #2  a  2  3
    
    #[[2]]
    #  V1 V2 V3
    #1  b  3  4
    #2  b  4  2
    
    #[[3]]
    #  V1 V2 V3
    #1  c  5  2
    
    0 讨论(0)
  • 2020-11-29 06:35

    To 'stick' to dplyr, you can also use plyr instead of split:

    library(plyr)
    
    dlply(df, "V1", identity)
    #$a
    #  V1 V2 V3
    #1  a  1  2
    #2  a  2  3
    
    #$b
    #  V1 V2 V3
    #1  b  3  4
    #2  b  4  2
    
    #$c
    #  V1 V2 V3
    #1  c  5  2
    
    0 讨论(0)
  • 2020-11-29 06:36

    Since dplyr 0.8 you can use group_split

    library(dplyr)
    df = as.data.frame(cbind(c("a","a","b","b","c"),c(1,2,3,4,5), c(2,3,4,2,2)))
    df %>% group_by(V1) %>% group_split()
    #> [[1]]
    #> # A tibble: 2 x 3
    #>   V1    V2    V3   
    #>   <fct> <fct> <fct>
    #> 1 a     1     2    
    #> 2 a     2     3    
    #> 
    #> [[2]]
    #> # A tibble: 2 x 3
    #>   V1    V2    V3   
    #>   <fct> <fct> <fct>
    #> 1 b     3     4    
    #> 2 b     4     2    
    #> 
    #> [[3]]
    #> # A tibble: 1 x 3
    #>   V1    V2    V3   
    #>   <fct> <fct> <fct>
    #> 1 c     5     2
    
    0 讨论(0)
提交回复
热议问题