Split a large dataframe into a list of data frames based on common value in column

后端 未结 3 939
离开以前
离开以前 2020-11-22 07:52

I have a data frame with 10 columns, collecting actions of \"users\", where one of the columns contains an ID (not unique, identifying user)(column 10). the length of the da

相关标签:
3条回答
  • 2020-11-22 08:46

    Stumbled across this answer and I actually wanted BOTH groups (data containing that one user and data containing everything but that one user). Not necessary for the specifics of this post, but I thought I would add in case someone was googling the same issue as me.

    df <- data.frame(
         ran_data1=rnorm(125),
         ran_data2=rnorm(125),
         g=rep(factor(LETTERS[1:5]), 25)
     )
    
    test_x = split(df,df$g)[['A']]
    test_y = split(df,df$g!='A')[['TRUE']]
    

    Here's what it looks like:

    head(test_x)
                x          y g
    1   1.1362198  1.2969541 A
    6   0.5510307 -0.2512449 A
    11  0.0321679  0.2358821 A
    16  0.4734277 -1.2889081 A
    21 -1.2686151  0.2524744 A
    
    > head(test_y)
                x          y g
    2 -2.23477293  1.1514810 B
    3 -0.46958938 -1.7434205 C
    4  0.07365603  0.1111419 D
    5 -1.08758355  0.4727281 E
    7  0.28448637 -1.5124336 B
    8  1.24117504  0.4928257 C
    
    0 讨论(0)
  • 2020-11-22 08:48

    You can just as easily access each element in the list using e.g. path[[1]]. You can't put a set of matrices into an atomic vector and access each element. A matrix is an atomic vector with dimension attributes. I would use the list structure returned by split, it's what it was designed for. Each list element can hold data of different types and sizes so it's very versatile and you can use *apply functions to further operate on each element in the list. Example below.

    #  For reproducibile data
    set.seed(1)
    
    #  Make some data
    userid <- rep(1:2,times=4)
    data1 <- replicate(8 , paste( sample(letters , 3 ) , collapse = "" ) )
    data2 <- sample(10,8)
    df <- data.frame( userid , data1 , data2 )
    
    #  Split on userid
    out <- split( df , f = df$userid )
    #$`1`
    #  userid data1 data2
    #1      1   gjn     3
    #3      1   yqp     1
    #5      1   rjs     6
    #7      1   jtw     5
    
    #$`2`
    #  userid data1 data2
    #2      2   xfv     4
    #4      2   bfe    10
    #6      2   mrx     2
    #8      2   fqd     9
    

    Access each element using the [[ operator like this:

    out[[1]]
    #  userid data1 data2
    #1      1   gjn     3
    #3      1   yqp     1
    #5      1   rjs     6
    #7      1   jtw     5
    

    Or use an *apply function to do further operations on each list element. For instance, to take the mean of the data2 column you could use sapply like this:

    sapply( out , function(x) mean( x$data2 ) )
    #   1    2 
    #3.75 6.25 
    
    0 讨论(0)
  • 2020-11-22 08:58

    From version 0.8.0, dplyr offers a handy function called group_split():

    # On sample data from @Aus_10
    df %>%
      group_split(g)
    
    [[1]]
    # A tibble: 25 x 3
       ran_data1 ran_data2 g    
           <dbl>     <dbl> <fct>
     1     2.04      0.627 A    
     2     0.530    -0.703 A    
     3    -0.475     0.541 A    
     4     1.20     -0.565 A    
     5    -0.380    -0.126 A    
     6     1.25     -1.69  A    
     7    -0.153    -1.02  A    
     8     1.52     -0.520 A    
     9     0.905    -0.976 A    
    10     0.517    -0.535 A    
    # … with 15 more rows
    
    [[2]]
    # A tibble: 25 x 3
       ran_data1 ran_data2 g    
           <dbl>     <dbl> <fct>
     1     1.61      0.858 B    
     2     1.05     -1.25  B    
     3    -0.440    -0.506 B    
     4    -1.17      1.81  B    
     5     1.47     -1.60  B    
     6    -0.682    -0.726 B    
     7    -2.21      0.282 B    
     8    -0.499     0.591 B    
     9     0.711    -1.21  B    
    10     0.705     0.960 B    
    # … with 15 more rows
    

    To not include the grouping column:

    df %>%
     group_split(g, keep = FALSE)
    
    0 讨论(0)
提交回复
热议问题