To parallelize a task, I need to split a big data.table to roughly equal parts,
keeping together groups deinfed by a column, id
. Suppose:
N
If k is big enough, you can use this idea to split data into groups:
First, lets find size for each of ids
group_sizes <- dt[, .N, by = id]
Then create 2 empty lists with length of M for detecting size of groups and which ids they would contain
grps_vals <- list()
grps_vals[1 : M] <- c(0)
grps_nms <- list()
grps_nms[1 : M] <- c(0)
(Here I specially added zero values to be able to create list of size M)
Then using loop on every iteration add values to the smallest group. It will make groups roughly equal
for ( i in 1:nrow(group_sizes)){
sums <- sapply(groups, sum)
idx <- which(sums == min(sums))[1]
groups[[idx]] <- c(groups[[idx]], group_sizes$N[i])
}
Finally, delete first zero element from list of names :)
grps_nms <- lapply(grps_nms, function(x){x[-1]})
> grps_nms
[[1]]
[1] "a" "d" "f"
[[2]]
[1] "b"
[[3]]
[1] "c" "e"