Split data.table into roughly equal parts

前端 未结 4 1071
抹茶落季
抹茶落季 2021-01-20 13:59

To parallelize a task, I need to split a big data.table to roughly equal parts, keeping together groups deinfed by a column, id. Suppose:

N

4条回答
  •  暖寄归人
    2021-01-20 14:25

    If k is big enough, you can use this idea to split data into groups:

    First, lets find size for each of ids

    group_sizes <- dt[, .N, by = id]
    

    Then create 2 empty lists with length of M for detecting size of groups and which ids they would contain

    grps_vals <- list()
    grps_vals[1 : M] <- c(0)
    
    grps_nms <- list()
    grps_nms[1 : M] <- c(0)
    

    (Here I specially added zero values to be able to create list of size M)

    Then using loop on every iteration add values to the smallest group. It will make groups roughly equal

    for ( i in 1:nrow(group_sizes)){
       sums <- sapply(groups, sum) 
       idx <- which(sums == min(sums))[1]
       groups[[idx]] <- c(groups[[idx]], group_sizes$N[i])
       }
    

    Finally, delete first zero element from list of names :)

    grps_nms <- lapply(grps_nms, function(x){x[-1]})
    
    > grps_nms
    [[1]]
    [1] "a" "d" "f"
    
    [[2]]
    [1] "b"
    
    [[3]]
    [1] "c" "e"
    

提交回复
热议问题