Split data.table into roughly equal parts

前端未结

关注

 4  1069

To parallelize a task, I need to split a big data.table to roughly equal parts, keeping together groups deinfed by a column, id. Suppose:

N

相关标签:

4条回答

一生所求

2021-01-20 14:12

Just an alternative approach using dplyr. Run the chained script step by step to visualise how the dataset changes through each step. It is a simple process.

    library(data.table)
    library(dplyr)

    set.seed(1)
    N <- 16 # in application N is very large
    k <- 6  # in application k << N
    dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>%
      arrange(id)



dt %>% 
  select(id) %>%
  distinct() %>%                   # select distinct id values
  mutate(group = ntile(id,3)) %>%  # create grouping 
  inner_join(dt, by="id")          # join back initial information

PS: I've learnt lots of useful stuff based on previous answers.

0 讨论(0)

温柔的废话

2021-01-20 14:13
Preliminary comment

I recommend reading what the main author of data.table has to say about parallelization with it.

I don't know how familiar you are with data.table, but you may have overlooked its by argument...? Quoting @eddi's comment from below...
Instead of literally splitting up the data - create a new "parallel.id" column, and then call
```
dt[, parallel_operation(.SD), by = parallel.id] 
```
Answer, assuming you don't want to use by

Sort the IDs by size:
```
ids   <- names(sort(table(dt$id)))
n     <- length(ids)
```
Rearrange so that we alternate between big and small IDs, following Arun's interleaving trick:
```
alt_ids <- c(ids, rev(ids))[order(c(1:n, 1:n))][1:n]
```
Split the ids in order, with roughly the same number of IDs in each group (like zero323's answer):
```
gs  <- split(alt_ids, ceiling(seq(n) / (n/M)))

res <- vector("list", M)
setkey(dt, id)
for (m in 1:M) res[[m]] <- dt[J(gs[[m]])] 
# if using a data.frame, replace the last two lines with
# for (m in 1:M) res[[m]] <- dt[id %in% gs[[m]],] 
```
Check that the sizes aren't too bad:
```
# using the OP's example data...

sapply(res, nrow)
# [1] 7 9              for M = 2
# [1] 5 5 6            for M = 3
# [1] 1 6 3 6          for M = 4
# [1] 1 4 2 3 6        for M = 5
```
Although I emphasized data.table at the top, this should work fine with a data.frame, too.
0 讨论(0)
发布评论:

提交评论
- 加载中...
情书的邮戳

2021-01-20 14:25
If distribution of the ids is not pathologically skewed the simplest approach would be simply something like this:
```
split(dt, as.numeric(as.factor(dt$id)) %% M)
```
It assigns id to the the bucket using factor-value mod number-of buckets.

For most applications it is just good enough to get a relatively balanced distribution of data. You should be careful with input like time series though. In such a case you can simply enforce random order of levels when you create factor. Choosing a prime number for M is a more robust approach but most likely less practical.
0 讨论(0)
发布评论:

提交评论
- 加载中...
暖寄归人

2021-01-20 14:25
If k is big enough, you can use this idea to split data into groups:

First, lets find size for each of ids
```
group_sizes <- dt[, .N, by = id]
```
Then create 2 empty lists with length of M for detecting size of groups and which ids they would contain
```
grps_vals <- list()
grps_vals[1 : M] <- c(0)

grps_nms <- list()
grps_nms[1 : M] <- c(0)
```
(Here I specially added zero values to be able to create list of size M)

Then using loop on every iteration add values to the smallest group. It will make groups roughly equal
```
for ( i in 1:nrow(group_sizes)){
   sums <- sapply(groups, sum) 
   idx <- which(sums == min(sums))[1]
   groups[[idx]] <- c(groups[[idx]], group_sizes$N[i])
   }
```
Finally, delete first zero element from list of names :)
```
grps_nms <- lapply(grps_nms, function(x){x[-1]})

> grps_nms
[[1]]
[1] "a" "d" "f"

[[2]]
[1] "b"

[[3]]
[1] "c" "e"
```
0 讨论(0)
发布评论:

提交评论
- 加载中...