Faster ways to calculate frequencies and cast from long to wide

前端 未结 4 917
灰色年华
灰色年华 2020-11-21 04:56

I am trying to obtain counts of each combination of levels of two variables, \"week\" and \"id\". I\'d like the result to have \"id\" as rows, and \"week\" as columns, and t

相关标签:
4条回答
  • 2020-11-21 05:29

    The reason ddply is taking so long is that the splitting by group is not run in parallel (only the computations on the 'splits'), therefore with a large number of groups it will be slow (and .parallel = T) will not help.

    An approach using data.table::dcast (data.table version >= 1.9.2) should be extremely efficient in time and memory. In this case, we can rely on default argument values and simply use:

    library(data.table) 
    dcast(setDT(data), id ~ week)
    # Using 'week' as value column. Use 'value.var' to override
    # Aggregate function missing, defaulting to 'length'
    #    id 1 2 3
    # 1:  1 2 1 1
    # 2:  2 0 0 1
    

    Or setting the arguments explicitly:

    dcast(setDT(data), id ~ week, value.var = "week", fun = length)
    #    id 1 2 3
    # 1:  1 2 1 1
    # 2:  2 0 0 1
    

    For pre-data.table 1.9.2 alternatives, see edits.

    0 讨论(0)
  • 2020-11-21 05:36

    You don't need ddply for this. The dcast from reshape2 is sufficient:

    dat <- data.frame(
        id = c(rep(1, 4), 2),
        week = c(1:3, 1, 3)
    )
    
    library(reshape2)
    dcast(dat, id~week, fun.aggregate=length)
    
      id 1 2 3
    1  1 2 1 1
    2  2 0 0 1
    

    Edit : For a base R solution (other than table - as posted by Joshua Uhlrich), try xtabs:

    xtabs(~id+week, data=dat)
    
       week
    id  1 2 3
      1 2 1 1
      2 0 0 1
    
    0 讨论(0)
  • 2020-11-21 05:36

    You could just use the table command:

    table(data$id,data$week)
    
        1 2 3
      1 2 1 1
      2 0 0 1
    

    If "id" and "week" are the only columns in your data frame, you can simply use:

    table(data)
    #    week
    # id  1 2 3
    #   1 2 1 1
    #   2 0 0 1
    
    0 讨论(0)
  • 2020-11-21 05:46

    A tidyverse option could be :

    library(dplyr)
    library(tidyr)
    
    df %>%
      count(id, week) %>%
      pivot_wider(names_from = week, values_from = n, values_fill = list(n = 0))
      #spread(week, n, fill = 0) #In older version of tidyr
    
    #     id   `1`   `2`   `3`
    #   <dbl> <dbl> <dbl> <dbl>
    #1     1     2     1     1
    #2     2     0     0     1
    

    Or using tabyl from janitor :

    janitor::tabyl(df, id, week)
    # id 1 2 3
    #  1 2 1 1
    #  2 0 0 1
    

    data

    df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L), week = c(1L, 2L, 3L, 
    1L, 3L)), class = "data.frame", row.names = c(NA, -5L))
    
    0 讨论(0)
提交回复
热议问题