Split data.table into roughly equal parts

前端未结

关注

 4  1070

抹茶落季 2021-01-20 13:59

To parallelize a task, I need to split a big data.table to roughly equal parts, keeping together groups deinfed by a column, id. Suppose:

N

4条回答

一生所求 (楼主)

2021-01-20 14:12

Just an alternative approach using dplyr. Run the chained script step by step to visualise how the dataset changes through each step. It is a simple process.

    library(data.table)
    library(dplyr)

    set.seed(1)
    N <- 16 # in application N is very large
    k <- 6  # in application k << N
    dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>%
      arrange(id)



dt %>% 
  select(id) %>%
  distinct() %>%                   # select distinct id values
  mutate(group = ntile(id,3)) %>%  # create grouping 
  inner_join(dt, by="id")          # join back initial information

PS: I've learnt lots of useful stuff based on previous answers.

0 讨论(0)

查看其它4个回答