Split data.table into roughly equal parts

前端 未结 4 1066
抹茶落季
抹茶落季 2021-01-20 13:59

To parallelize a task, I need to split a big data.table to roughly equal parts, keeping together groups deinfed by a column, id. Suppose:

N

相关标签:
4条回答
  • 2021-01-20 14:12

    Just an alternative approach using dplyr. Run the chained script step by step to visualise how the dataset changes through each step. It is a simple process.

        library(data.table)
        library(dplyr)
    
        set.seed(1)
        N <- 16 # in application N is very large
        k <- 6  # in application k << N
        dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>%
          arrange(id)
    
    
    
    dt %>% 
      select(id) %>%
      distinct() %>%                   # select distinct id values
      mutate(group = ntile(id,3)) %>%  # create grouping 
      inner_join(dt, by="id")          # join back initial information
    

    PS: I've learnt lots of useful stuff based on previous answers.

    0 讨论(0)
  • 2021-01-20 14:13

    Preliminary comment

    I recommend reading what the main author of data.table has to say about parallelization with it.

    I don't know how familiar you are with data.table, but you may have overlooked its by argument...? Quoting @eddi's comment from below...

    Instead of literally splitting up the data - create a new "parallel.id" column, and then call

    dt[, parallel_operation(.SD), by = parallel.id] 
    

    Answer, assuming you don't want to use by

    Sort the IDs by size:

    ids   <- names(sort(table(dt$id)))
    n     <- length(ids)
    

    Rearrange so that we alternate between big and small IDs, following Arun's interleaving trick:

    alt_ids <- c(ids, rev(ids))[order(c(1:n, 1:n))][1:n]
    

    Split the ids in order, with roughly the same number of IDs in each group (like zero323's answer):

    gs  <- split(alt_ids, ceiling(seq(n) / (n/M)))
    
    res <- vector("list", M)
    setkey(dt, id)
    for (m in 1:M) res[[m]] <- dt[J(gs[[m]])] 
    # if using a data.frame, replace the last two lines with
    # for (m in 1:M) res[[m]] <- dt[id %in% gs[[m]],] 
    

    Check that the sizes aren't too bad:

    # using the OP's example data...
    
    sapply(res, nrow)
    # [1] 7 9              for M = 2
    # [1] 5 5 6            for M = 3
    # [1] 1 6 3 6          for M = 4
    # [1] 1 4 2 3 6        for M = 5
    

    Although I emphasized data.table at the top, this should work fine with a data.frame, too.

    0 讨论(0)
  • 2021-01-20 14:25

    If distribution of the ids is not pathologically skewed the simplest approach would be simply something like this:

    split(dt, as.numeric(as.factor(dt$id)) %% M)
    

    It assigns id to the the bucket using factor-value mod number-of buckets.

    For most applications it is just good enough to get a relatively balanced distribution of data. You should be careful with input like time series though. In such a case you can simply enforce random order of levels when you create factor. Choosing a prime number for M is a more robust approach but most likely less practical.

    0 讨论(0)
  • 2021-01-20 14:25

    If k is big enough, you can use this idea to split data into groups:

    First, lets find size for each of ids

    group_sizes <- dt[, .N, by = id]
    

    Then create 2 empty lists with length of M for detecting size of groups and which ids they would contain

    grps_vals <- list()
    grps_vals[1 : M] <- c(0)
    
    grps_nms <- list()
    grps_nms[1 : M] <- c(0)
    

    (Here I specially added zero values to be able to create list of size M)

    Then using loop on every iteration add values to the smallest group. It will make groups roughly equal

    for ( i in 1:nrow(group_sizes)){
       sums <- sapply(groups, sum) 
       idx <- which(sums == min(sums))[1]
       groups[[idx]] <- c(groups[[idx]], group_sizes$N[i])
       }
    

    Finally, delete first zero element from list of names :)

    grps_nms <- lapply(grps_nms, function(x){x[-1]})
    
    > grps_nms
    [[1]]
    [1] "a" "d" "f"
    
    [[2]]
    [1] "b"
    
    [[3]]
    [1] "c" "e"
    
    0 讨论(0)
提交回复
热议问题