Reproducible splitting of data into training and testing in R

本秂侑毒 提交于 2019-12-04 22:26:00

Use a hash function and sample on the mod of its last digit:

md5_bit_mod <- function(x, m = 2L) {
  # Inputs: 
  #  x: a character vector of ids
  #  m: the modulo divisor (modify for split proportions other than 50:50)
  # Output: remainders from dividing the first digit of the md5 hash of x by m
  as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
}

hash splitting works better in this case, because the assignment of test/train is determined by the hash of each obs., and not by its relative location in the data

test1a <- sample1[md5_bit_mod(id) == 0L, .(id)]
test2a <- sample2[md5_bit_mod(id) == 0L, .(id)]

nrow(merge(test1a, test2a))

[1] 5057

nrow(test1a)

[1] 5057

sample size is not exactly 5000 because assignment is probabilistic, but it shouldn't be a problem in large samples thanks to the law of large numbers.

See also: http://blog.richardweiss.org/2016/12/25/hash-splits.html and https://crypto.stackexchange.com/questions/20742/statistical-properties-of-hash-functions-when-calculating-modulo

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!