Split data.table into roughly equal parts

前端 未结 4 1067
抹茶落季
抹茶落季 2021-01-20 13:59

To parallelize a task, I need to split a big data.table to roughly equal parts, keeping together groups deinfed by a column, id. Suppose:

N

4条回答
  •  一生所求
    2021-01-20 14:12

    Just an alternative approach using dplyr. Run the chained script step by step to visualise how the dataset changes through each step. It is a simple process.

        library(data.table)
        library(dplyr)
    
        set.seed(1)
        N <- 16 # in application N is very large
        k <- 6  # in application k << N
        dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>%
          arrange(id)
    
    
    
    dt %>% 
      select(id) %>%
      distinct() %>%                   # select distinct id values
      mutate(group = ntile(id,3)) %>%  # create grouping 
      inner_join(dt, by="id")          # join back initial information
    

    PS: I've learnt lots of useful stuff based on previous answers.

提交回复
热议问题