Isolate randomness of a local environment from the global R process

故事扮演 提交于 2019-12-22 04:15:17

问题


We can use set.seed() to set a random seed in R, and this has a global effect. Here is a minimal example to illustrate my goal:

set.seed(0)
runif(1)
# [1] 0.8966972

set.seed(0)
f <- function() {
  # I do not want this random number to be affected by the global seed
  runif(1)
}
f()
# [1] 0.8966972

Basically I want to be able to avoid the effect of the global random seed (i.e., .Random.seed) in a local environment, such as an R function, so that I can achieve some sort of randomness over which the user has no control. For example, even if the user has set.seed(), he will still get different output every time he calls this function.

Now there are two implementations. The first one relies on set.seed(NULL) to let R re-initialize the random seed every time I want to get some random numbers:

createUniqueId <- function(bytes) {
  withPrivateSeed(
    paste(as.hexmode(sample(256, bytes, replace = TRUE) - 1), collapse = "")
  )
}
withPrivateSeed <- function(expr, seed = NULL) {
  oldSeed <- if (exists('.Random.seed', envir = .GlobalEnv, inherits = FALSE)) {
    get('.Random.seed', envir = .GlobalEnv, inherits = FALSE)
  }
  if (!is.null(oldSeed)) {
    on.exit(assign('.Random.seed', oldSeed, envir = .GlobalEnv), add = TRUE)
  }
  set.seed(seed)
  expr
}

You can see I get different id strings even if I set the seed to 0, and the global random number stream is still reproducible:

> set.seed(0)
> runif(3)
[1] 0.8966972 0.2655087 0.3721239
> createUniqueId(4)
[1] "83a18600"
> runif(3)
[1] 0.5728534 0.9082078 0.2016819

> set.seed(0)
> runif(3)  # same
[1] 0.8966972 0.2655087 0.3721239
> createUniqueId(4)  # different
[1] "77cb3d91"
> runif(3)
[1] 0.5728534 0.9082078 0.2016819

> set.seed(0)
> runif(3)
[1] 0.8966972 0.2655087 0.3721239
> createUniqueId(4)
[1] "c41d61d8"
> runif(3)
[1] 0.5728534 0.9082078 0.2016819

The second implementation can be found here on Github. It is more complicated, and the basic idea is:

  • initialize the random seed during package startup using set.seed(NULL) (in .onLoad())
  • store the random seed in a separate environment (.globals$ownSeed)
  • each time when we want to generate random numbers:
    1. assign the local seed to the global random seed
    2. generate random numbers
    3. assign the new global seed (it has changed due to step 2) to the local seed
    4. restore the global seed to its original value

Now my question is if the two approaches are equivalent in theory. The randomness of first approach relies on the current time and process ID when createUniqueId() is called, and the second approach seems to rely on the time and process ID when the package is loaded. For the first approach, is it possible that two calls of createUniqueId() happen exactly at the same time in the same R process so that they return the same id string?

Update

In the answer below, Robert Krzyzanowski provided some empirical evidence that set.seed(NULL) can lead to serious ID collisions. I did a simple visualization for it:

createGlobalUniqueId <- function(bytes) {
  paste(as.hexmode(sample(256, bytes, replace = TRUE) - 1), collapse = "")
}
n <- 10000
length(unique(replicate(n, createGlobalUniqueId(5))))
length(unique(x <- replicate(n, createUniqueId(5))))
# denote duplicated values by 1, and unique ones by 0
png('rng-time.png', width = 4000, height = 400)
par(mar = c(4, 4, .1, .1), xaxs = 'i')
plot(1:n, duplicated(x), type = 'l')
dev.off()

When the line reaches the top of the plot, that means there is a duplicate value generated. However, note these duplicates do not come successively, i.e. any(x[-1] == x[-n]) is normally FALSE. There might be a pattern for the duplication associated with the system time. I'm not able to investigate further due to my lack of understanding of how the time-based random seed works, but you can see the relevant pieces of C source code here and here.


回答1:


I thought it would be nice to have just an independent RNG inside your function, that is not affected by the global seed, but would have its own seed. Turns out, randtoolbox offers this functionality:

library(randtoolbox)
replicate(3, {
  set.seed(1)
  c(runif(1), WELL(3), runif(1))
})   
#            [,1]      [,2]      [,3]
#[1,] 0.265508663 0.2655087 0.2655087
#[2,] 0.481195594 0.3999952 0.9474923
#[3,] 0.003865934 0.6596869 0.4684255
#[4,] 0.484556709 0.9923884 0.1153879
#[5,] 0.372123900 0.3721239 0.3721239

Top and bottom rows are affected by the seed, whereas middle ones are "truly random".

Based on that, here's the implementation of your function:

sample_WELL <- function(n, size=n) {
  findInterval(WELL(size), 0:n/n)
}

createUniqueId_WELL <- function(bytes) {
  paste(as.hexmode(sample_WELL(256, bytes) - 1), collapse = "")
}

length(unique(replicate(10000, createUniqueId_WELL(5))))
#[1] 10000

# independency on the seed: 
set.seed(1)
x <- replicate(100, createGlobalUniqueId(5))
x_WELL <- replicate(100, createUniqueId_WELL(5))
set.seed(1)
y <- replicate(100, createGlobalUniqueId(5))
y_WELL <- replicate(100, createUniqueId_WELL(5))
sum(x==y)
#[1] 100
sum(x_WELL==y_WELL)
#[1] 0

Edit

To understand why we get duplicated keys, we should take a look what happens when we call set.seed(NULL). All RNG-related code is written in C, so we should go directly to svn.r-project.org/R/trunk/src/main/RNG.c and refer to the function do_setseed. If seed = NULL then clearly TimeToSeed is called. There's a comment that states it should be located in datetime.c, however, it can be found in svn.r-project.org/R/trunk/src/main/times.c.

Navigating the R source can be difficult, so I'm pasting the function here:

/* For RNG.c, main.c, mkdtemp.c */
attribute_hidden
unsigned int TimeToSeed(void)
{
    unsigned int seed, pid = getpid();
#if defined(HAVE_CLOCK_GETTIME) && defined(CLOCK_REALTIME)
    {
    struct timespec tp;
    clock_gettime(CLOCK_REALTIME, &tp);
    seed = (unsigned int)(((uint_least64_t) tp.tv_nsec << 16) ^ tp.tv_sec);
    }
#elif defined(HAVE_GETTIMEOFDAY)
    {
    struct timeval tv;
    gettimeofday (&tv, NULL);
    seed = (unsigned int)(((uint_least64_t) tv.tv_usec << 16) ^ tv.tv_sec);
    }
#else
    /* C89, so must work */
    seed = (Int32) time(NULL);
#endif
    seed ^= (pid <<16);
    return seed;
}

So each time we call set.seed(NULL), R does these steps:

  1. Takes current time in seconds and nanoseconds (if possible, platform dependency here in #if defined blocks)
  2. Applies bit shift to nanoseconds and bit 'xor'es result with seconds
  3. Applies bit shift to pid and bit 'xor'es it with the previous result
  4. Sets the result as a new seed

Well, now it's almost obvious that we get duplicated values when the resulting seeds collide. My guess is this happens when two calls fall within 1 second, so that tv_sec is constant. To confirm that, I'm introducing a lag:

createUniqueIdWithLag <- function(bytes, lag) {
  Sys.sleep(lag)
  createUniqueId(bytes)
}
lags <- 1 / 10 ^ (1:5)
sapply(lags, function(x) length(unique(replicate(n, createUniqueIdWithLag(5, x)))))
[1] 1000 1000  996  992  990

What's confusing is that even the lag is substantial compared to nanoseconds, we still get collisions! Let's dig it further then, I wrote a "debugging emulator" for the seed:

emulate_seed <- function() {
  tv <- as.numeric(system('echo $(($(date +%s%N)))', intern = TRUE))
  pid <- Sys.getpid()
  tv_nsec <- tv %% 1e9
  tv_sec <- tv %/% 1e9
  seed <- bitwXor(bitwShiftL(tv_nsec, 16), tv_sec)
  seed <- bitwXor(bitwShiftL(pid, 16), seed)
  c(seed, tv_nsec, tv_sec, pid)
}

z <- replicate(1000, emulate_seed())
sapply(1:4, function(i) length(unique(z[i, ])))
# unique seeds, nanosecs, secs, pids:
#[1]  941 1000   36    1

That is really confusing: nanoseconds are all unique, and that does not guarantee uniqueness of the final seed. To demonstrate that effect, here's one of the duplicates:

#            [,1]        [,2] 
#[1,] -1654969360 -1654969360
#[2,]   135644672   962643456
#[3,]  1397894128  1397894128 
#[4,]        2057        2057
bitwShiftL(135644672, 16)
#[1] -973078528
bitwShiftL(962643456, 16)
#[1] -973078528

The final note: the binary representation of these two numbers and the shift is

00001000000101011100011000000000 << 16 => 1100011000000000 + 16 zeroes
00111001011000001100011000000000 << 16 => 1100011000000000 + 16 zeroes

So yes, this is really an unwanted collision.

Well, after all that, the final conclusion is: set.seed(NULL) is vulnerable to high load and does not guarantee the absence of collisions when dealing with multiple consecutive calls!




回答2:


For the first approach, it does indeed seem to be possible that two calls of createUniqueId() happen at exactly the same time in the same R process and return the same ID string.

length(unique(sapply(seq_len(100000), function(.) createUniqueId(5))))
# [1] 93906
createGlobalUniqueId <- function(bytes) paste(as.hexmode(sample(256, bytes, replace = TRUE) - 1), collapse = "")
length(unique(sapply(seq_len(100000), function(.) createGlobalUniqueId(5))))
# [1] 100000

Therefore, I would go with the second approach if you do not want ID collisions.



来源:https://stackoverflow.com/questions/23090958/isolate-randomness-of-a-local-environment-from-the-global-r-process

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!