Faster ways to calculate frequencies and cast from long to wide

前端 未结 4 925
灰色年华
灰色年华 2020-11-21 04:56

I am trying to obtain counts of each combination of levels of two variables, \"week\" and \"id\". I\'d like the result to have \"id\" as rows, and \"week\" as columns, and t

4条回答
  •  独厮守ぢ
    2020-11-21 05:29

    The reason ddply is taking so long is that the splitting by group is not run in parallel (only the computations on the 'splits'), therefore with a large number of groups it will be slow (and .parallel = T) will not help.

    An approach using data.table::dcast (data.table version >= 1.9.2) should be extremely efficient in time and memory. In this case, we can rely on default argument values and simply use:

    library(data.table) 
    dcast(setDT(data), id ~ week)
    # Using 'week' as value column. Use 'value.var' to override
    # Aggregate function missing, defaulting to 'length'
    #    id 1 2 3
    # 1:  1 2 1 1
    # 2:  2 0 0 1
    

    Or setting the arguments explicitly:

    dcast(setDT(data), id ~ week, value.var = "week", fun = length)
    #    id 1 2 3
    # 1:  1 2 1 1
    # 2:  2 0 0 1
    

    For pre-data.table 1.9.2 alternatives, see edits.

提交回复
热议问题