To parallelize a task, I need to split a big data.table to roughly equal parts,
keeping together groups deinfed by a column, id
. Suppose:
N
Just an alternative approach using dplyr. Run the chained script step by step to visualise how the dataset changes through each step. It is a simple process.
library(data.table)
library(dplyr)
set.seed(1)
N <- 16 # in application N is very large
k <- 6 # in application k << N
dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>%
arrange(id)
dt %>%
select(id) %>%
distinct() %>% # select distinct id values
mutate(group = ntile(id,3)) %>% # create grouping
inner_join(dt, by="id") # join back initial information
PS: I've learnt lots of useful stuff based on previous answers.