I would like to efficiently make a random sample by group from a data.table
, but it should be possible to sample a different proportion for each group.
If I
You can use .GRP
but to ensure a correct group is matched.. you might want to define group_col
as a factor variable.
group_sampler <- function(data, group_col, sample_fractions) {
# this function samples sample_fraction <0,1> from each group in the data.table
# inputs:
# data - data.table
# group_col - column(s) used to group by
# sample_fraction - a value between 0 and 1 indicating what % of each group should be sampled
stopifnot(length(sample_fractions) == uniqueN(data[[group_col]]))
data[, .SD[sample(.N, ceiling(.N*sample_fractions[.GRP]))], keyby = group_col]
}
Edit in response to chinsoon12's comment:
It would be safer (instead of relying on correct order) to have the last line of the function:
data[, .SD[sample(.N, ceiling(.N*sample_fractions[[unlist(.BY)]]))], keyby = group_col]
And then you pass sample_fractions
as a named vector:
group_sampler(DT, 'a', sample_fractions= c(x = 0.1, y = 0.9))