To parallelize a task, I need to split a big data.table to roughly equal parts,
keeping together groups deinfed by a column, id
. Suppose:
N
Just an alternative approach using dplyr. Run the chained script step by step to visualise how the dataset changes through each step. It is a simple process.
library(data.table)
library(dplyr)
set.seed(1)
N <- 16 # in application N is very large
k <- 6 # in application k << N
dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>%
arrange(id)
dt %>%
select(id) %>%
distinct() %>% # select distinct id values
mutate(group = ntile(id,3)) %>% # create grouping
inner_join(dt, by="id") # join back initial information
PS: I've learnt lots of useful stuff based on previous answers.
Preliminary comment
I recommend reading what the main author of data.table has to say about parallelization with it.
I don't know how familiar you are with data.table, but you may have overlooked its by
argument...? Quoting @eddi's comment from below...
Instead of literally splitting up the data - create a new "parallel.id" column, and then call
dt[, parallel_operation(.SD), by = parallel.id]
Answer, assuming you don't want to use by
Sort the IDs by size:
ids <- names(sort(table(dt$id)))
n <- length(ids)
Rearrange so that we alternate between big and small IDs, following Arun's interleaving trick:
alt_ids <- c(ids, rev(ids))[order(c(1:n, 1:n))][1:n]
Split the ids in order, with roughly the same number of IDs in each group (like zero323's answer):
gs <- split(alt_ids, ceiling(seq(n) / (n/M)))
res <- vector("list", M)
setkey(dt, id)
for (m in 1:M) res[[m]] <- dt[J(gs[[m]])]
# if using a data.frame, replace the last two lines with
# for (m in 1:M) res[[m]] <- dt[id %in% gs[[m]],]
Check that the sizes aren't too bad:
# using the OP's example data...
sapply(res, nrow)
# [1] 7 9 for M = 2
# [1] 5 5 6 for M = 3
# [1] 1 6 3 6 for M = 4
# [1] 1 4 2 3 6 for M = 5
Although I emphasized data.table
at the top, this should work fine with a data.frame
, too.
If distribution of the ids is not pathologically skewed the simplest approach would be simply something like this:
split(dt, as.numeric(as.factor(dt$id)) %% M)
It assigns id
to the the bucket using factor-value mod number-of buckets.
For most applications it is just good enough to get a relatively balanced distribution of data. You should be careful with input like time series though. In such a case you can simply enforce random order of levels when you create factor. Choosing a prime number for M is a more robust approach but most likely less practical.
If k is big enough, you can use this idea to split data into groups:
First, lets find size for each of ids
group_sizes <- dt[, .N, by = id]
Then create 2 empty lists with length of M for detecting size of groups and which ids they would contain
grps_vals <- list()
grps_vals[1 : M] <- c(0)
grps_nms <- list()
grps_nms[1 : M] <- c(0)
(Here I specially added zero values to be able to create list of size M)
Then using loop on every iteration add values to the smallest group. It will make groups roughly equal
for ( i in 1:nrow(group_sizes)){
sums <- sapply(groups, sum)
idx <- which(sums == min(sums))[1]
groups[[idx]] <- c(groups[[idx]], group_sizes$N[i])
}
Finally, delete first zero element from list of names :)
grps_nms <- lapply(grps_nms, function(x){x[-1]})
> grps_nms
[[1]]
[1] "a" "d" "f"
[[2]]
[1] "b"
[[3]]
[1] "c" "e"