Reproducible splitting of data into training and testing in R

本小妞迷上赌 提交于 2020-01-01 22:15:12

问题


A common way for sampling/splitting data in R is using sample, e.g., on row numbers. For example:

require(data.table)
set.seed(1)

population <- as.character(1e5:(1e6-1))  # some made up ID names

N <- 1e4  # sample size

sample1 <- data.table(id = sort(sample(population, N)))  # randomly sample N ids
test <- sample(N-1, N/2, replace = F)
test1 <- sample1[test, .(id)]

The problem is that this isn't very robust to changes in the data. For example if we drop just one observation:

sample2 <- sample1[-sample(N, 1)]  

samples 1 and 2 are still all but identical:

nrow(merge(sample1, sample2))

[1] 9999

Yet the same row splitting yields very different test sets, even though we've set the seed:

test2 <- sample2[test, .(id)]
nrow(test1)

[1] 5000

nrow(merge(test1, test2))

[1] 2653

One could sample specific IDs, but this would not be robust in case observations are omitted or added.

What would be a way to make the split more robust to changes to the data? Namely, have the assignment to test unchanged for unchanged observations, not assign dropped observations, and reassign new observations?


回答1:


Use a hash function and sample on the mod of its last digit:

md5_bit_mod <- function(x, m = 2L) {
  # Inputs: 
  #  x: a character vector of ids
  #  m: the modulo divisor (modify for split proportions other than 50:50)
  # Output: remainders from dividing the first digit of the md5 hash of x by m
  as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
}

hash splitting works better in this case, because the assignment of test/train is determined by the hash of each obs., and not by its relative location in the data

test1a <- sample1[md5_bit_mod(id) == 0L, .(id)]
test2a <- sample2[md5_bit_mod(id) == 0L, .(id)]

nrow(merge(test1a, test2a))

[1] 5057

nrow(test1a)

[1] 5057

sample size is not exactly 5000 because assignment is probabilistic, but it shouldn't be a problem in large samples thanks to the law of large numbers.

See also: http://blog.richardweiss.org/2016/12/25/hash-splits.html and https://crypto.stackexchange.com/questions/20742/statistical-properties-of-hash-functions-when-calculating-modulo



来源:https://stackoverflow.com/questions/52769681/reproducible-splitting-of-data-into-training-and-testing-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!