Seeding random number generators in parallel programs

问题

I am studing the multiprocessing module of Python. I have two cases:

Ex. 1

def Foo(nbr_iter):
    for step in xrange(int(nbr_iter)) :
        print random.uniform(0,1)
...

from multiprocessing import Pool

if __name__ == "__main__":
    ...
    pool = Pool(processes=nmr_parallel_block)
    pool.map(Foo, nbr_trial_per_process)

Ex 2. (using numpy)

 def Foo_np(nbr_iter):
     np.random.seed()
     print np.random.uniform(0,1,nbr_iter)

In both cases the random number generators are seeded in their forked processes.

Why do I have to do the seeding explicitly in the numpy example, but not in the Python example?

回答1:

If no seed is provided explicitly, numpy.random will seed itself using an OS-dependent source of randomness. Usually it will use /dev/urandom on Unix-based systems (or some Windows equivalent), but if this is not available for some reason then it will seed itself from the wall clock. Since self-seeding occurs at the time when a new subprocess forks, it is possible for multiple subprocesses to inherit the same seed if they forked at the same time, leading to identical random variates being produced by different subprocesses.

Often this correlates with the number of concurrent threads you are running. For example:

import numpy as np
import random
from multiprocessing import Pool

def Foo_np(seed=None):
    # np.random.seed(seed)
    return np.random.uniform(0, 1, 5)

pool = Pool(processes=8)
print np.array(pool.map(Foo_np, xrange(20)))

# [[ 0.14463001  0.80273208  0.5559258   0.55629762  0.78814652] <-
#  [ 0.14463001  0.80273208  0.5559258   0.55629762  0.78814652] <-
#  [ 0.14463001  0.80273208  0.5559258   0.55629762  0.78814652] <-
#  [ 0.14463001  0.80273208  0.5559258   0.55629762  0.78814652] <-
#  [ 0.14463001  0.80273208  0.5559258   0.55629762  0.78814652] <-
#  [ 0.14463001  0.80273208  0.5559258   0.55629762  0.78814652] <-
#  [ 0.14463001  0.80273208  0.5559258   0.55629762  0.78814652] <-
#  [ 0.64672339  0.99851749  0.8873984   0.42734339  0.67158796]
#  [ 0.64672339  0.99851749  0.8873984   0.42734339  0.67158796]
#  [ 0.64672339  0.99851749  0.8873984   0.42734339  0.67158796]
#  [ 0.64672339  0.99851749  0.8873984   0.42734339  0.67158796]
#  [ 0.64672339  0.99851749  0.8873984   0.42734339  0.67158796]
#  [ 0.11283279  0.28180632  0.28365286  0.51190168  0.62864241]
#  [ 0.11283279  0.28180632  0.28365286  0.51190168  0.62864241]
#  [ 0.28917586  0.40997875  0.06308188  0.71512199  0.47386047]
#  [ 0.11283279  0.28180632  0.28365286  0.51190168  0.62864241]
#  [ 0.64672339  0.99851749  0.8873984   0.42734339  0.67158796]
#  [ 0.11283279  0.28180632  0.28365286  0.51190168  0.62864241]
#  [ 0.14463001  0.80273208  0.5559258   0.55629762  0.78814652] <-
#  [ 0.11283279  0.28180632  0.28365286  0.51190168  0.62864241]]

You can see that groups of up to 8 threads simultaneously forked with the same seed, giving me identical random sequences (I've marked the first group with arrows).

Calling np.random.seed() within a subprocess forces the thread-local RNG instance to seed itself again from /dev/urandom or the wall clock, which will (probably) prevent you from seeing identical output from multiple subprocesses. Best practice is to explicitly pass a different seed (or numpy.random.RandomState instance) to each subprocess, e.g.:

def Foo_np(seed=None):
    local_state = np.random.RandomState(seed)
    print local_state.uniform(0, 1, 5)

pool.map(Foo_np, range(20))

I'm not entirely sure what underlies the differences between random and numpy.random in this respect (perhaps it has slightly different rules for selecting a source of randomness to self-seed with compared to numpy.random?). I would still recommend explicitly passing a seed or a random.Random instance to each subprocess to be on the safe side. You could also use the .jumpahead() method of random.Random which is designed for shuffling the states of Random instances in multithreaded programs.

回答2:

Here is a nice blog post that will explains the way numpy.random works.

If you use np.random.rand() it will takes the seed created when you imported the np.random module. So you need to create a new seed at each thread manually (cf examples in the blog post for example).

The python random module does not have this issue and automatically generates different seed for each thread.

回答3:

numpy 1.17 just introduced [quoting] "..three strategies implemented that can be used to produce repeatable pseudo-random numbers across multiple processes (local or distributed).."

the 1st strategy is using a SeedSequence object. There are many parent / child options there, but for our case, if you want the same generated random numbers, but different at each run:

(python3, printing 3 random numbers from 4 processes)

from numpy.random import SeedSequence, default_rng
from multiprocessing import Pool

def rng_mp(rng):
    return [ rng.random() for i in range(3) ]

seed_sequence = SeedSequence()
n_proc = 4
pool = Pool(processes=n_proc)
pool.map(rng_mp, [ default_rng(seed_sequence) for i in range(n_proc) ])

# 2 different runs
[[0.2825724770857644, 0.6465318335272593, 0.4620869345284885],
 [0.2825724770857644, 0.6465318335272593, 0.4620869345284885],
 [0.2825724770857644, 0.6465318335272593, 0.4620869345284885],
 [0.2825724770857644, 0.6465318335272593, 0.4620869345284885]]

[[0.04503760429109904, 0.2137916986051025, 0.8947678672387492],
 [0.04503760429109904, 0.2137916986051025, 0.8947678672387492],
 [0.04503760429109904, 0.2137916986051025, 0.8947678672387492],
 [0.04503760429109904, 0.2137916986051025, 0.8947678672387492]]

If you want the same result for reproducing purposes, you can simply reseed numpy with the same seed (17):

import numpy as np
from multiprocessing import Pool

def rng_mp(seed):
    np.random.seed(seed)
    return [ np.random.rand() for i in range(3) ]

n_proc = 4
pool = Pool(processes=n_proc)
pool.map(rng_mp, [17] * n_proc)

# same results each run:
[[0.2946650026871097, 0.5305867556052941, 0.19152078694749486],
 [0.2946650026871097, 0.5305867556052941, 0.19152078694749486],
 [0.2946650026871097, 0.5305867556052941, 0.19152078694749486],
 [0.2946650026871097, 0.5305867556052941, 0.19152078694749486]]

来源：https://stackoverflow.com/questions/29854398/seeding-random-number-generators-in-parallel-programs

标签

python

numpy

random

multiprocessing