I am attempting to multithread some code with OMP. Currently my sequentially version using rand() to generate a set of random numbers with a consistent seed so that they ret
There seem to be a couple problems getting conflated here.
First, the non-thread-safe nature of rand()
function means that calling rand()
at the same time from different threads might yield different values than if it were called sequentially. It's probably easiest to explain this with a simple example, so let's look at a 32-bit version of PCG since it's nice and simple. It has a 32-bit state, and generates 32-bit numbers like this:
static uint32_t state = ...;
static uint32_t generate(void) {
uint32_t s = state;
uint32_t v = ((s >> ((s >> 28) + 4)) ^ s) * (277803737U);
v ^= v >> 22;
state = state * 747796405U + 1729U;
return v;
}
Now think about what happens if two threads call generate()
at roughly the same time. Maybe they both get the same value for state
, and so generate the same random number twice. Maybe one updates state
before the other reads it, so they get different values.
We can eliminate the problem by protecting the generate()
function with a mutex or, in the case of 32-bit PGC (and this is why I use it for reproducible numbers), using atomics. If we do that then we'll always get the same numbers in the same order.
Part two of the problem is what happens when the order of the callers gets mixed up in your code. Let's say you have two threads (called A and B), and they each have to run two iterations of your loop. Even if you're getting your random numbers from a thread-safe source, the order of the calls could be AABB, ABAB, ABBA, BBAA, BABA, or BAAB, each of which would cause your code to generate a different result.
There are several straightforward ways around this. First, you could use synchronization primitives to ensure that each iteration calls generate
in the order you want. The easiest way would probably be to use a queue, but you'll waste a lot of time synchronizing and you'll lose some opportunities for parallelism (and you have to restructure your code significantly).
If you have a relatively small number of iterations, you might consider generating an array ahead of time. Think:
int i;
int nums[LEN];
for (i = 0 ; i < LEN ; i++)
nums[i] = generate();
#pragma omp parallel for ...
for (i = 0 ; i < LEN ; i++) {
do_stuff(nums[i]);
}
A better solution, though, might be to just ditch the idea of generating random numbers altogether and instead use hashing. https://burtleburtle.net/bob/hash/integer.html has some options. An example of that would be something like
for (int i = 0 ; i < LEN ; i++) {
do_stuff(hash(i));
}
Of course you can throw in some salt, maybe even use rand()
to generate your salt.
Here is a block based approach that divides the problem space in N/BLOCK_SIZE blocks and reseeds the RNG with your randInit + block number for each block. This gives a reproducible output no matter the number of threads you have. It also generates the same initial N numbers for sequence of N + x. This as long as you keep the same BLOCK_SIZE.
A good block-size is probably something like your typical N / (max_num_procs * 2). But there is room for experimentation.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define N_DEFAULT 48 //Default number of nodes
#define BLOCK_SIZE 12 //BLOCK SIZE number of nodes per block.
//Changes this changes reseed frequencey,
//.. and hence the generated sequence
#define randInit 42 //Had to be something.
int main(int argc , char* argv[])
{
int N=N_DEFAULT;
if (argc >1)
N=atoi(argv[1]);
int rands[N];// keep our random numbers for sequential debug output
int n=BLOCK_SIZE;
int num_blocks=(N+BLOCK_SIZE-1)/ BLOCK_SIZE; // ceil(N/BLOCK_SIZE)
int nt=omp_get_max_threads();
printf(" N: %d\n",N);
printf(" Blocks: %d, (size: %d)\n",num_blocks,n);
printf(" threads: %d\n",nt);
//Parallel random generation
#pragma omp parallel for schedule(runtime)
for (int J=0;J<num_blocks;J++)
{
int block_seed=randInit+J; // unique block seed
int start = J * n;
int end= start+n > N?N:start+n;
int tid = omp_get_thread_num(); // Just for debug
printf("thread %d: works on block %d (%d - %d )\n",tid,J,start,end);
for (int j=start; j < end;j++)
{
int t=rand_r(&block_seed); //rand_r provides thread safe (re-entrant rand)
rands[j]=t;
}
}
//Output for debug single thread
for (int j=0; j < N;j++)
{
printf("%d : %d \n",j,rands[j]);
}
return 0;
}
Output with different N, and number of threads shown below.
N: 24 N: 27
Blocks: 3, (size: 8) Blocks: 4, (size: 8)
threads: 4 threads: 1
-------------------------------------|-------------------------------
thread 1: works on block 1 (8 - 16 ) thread 0: works on block 0 (0 - 8 )
thread 2: works on block 2 (16 - 24 ) thread 0: works on block 1 (8 - 16 )
thread 0: works on block 0 (0 - 8 ) thread 0: works on block 2 (16 - 24 )
thread 0: works on block 3 (24 - 27 )
-------------------------------------|-------------------------------
0 : 681191333 0 : 681191333
1 : 928546885 1 : 928546885
2 : 1457394273 2 : 1457394273
3 : 941445650 3 : 941445650
4 : 2129613237 4 : 2129613237
5 : 1661015563 5 : 1661015563
6 : 2071432601 6 : 2071432601
7 : 222443696 7 : 222443696
8 : 1156886562 8 : 1156886562
9 : 398918689 9 : 398918689
10 : 170756699 10 : 170756699
11 : 703115845 11 : 703115845
12 : 1424182583 12 : 1424182583
13 : 1516198481 13 : 1516198481
14 : 1740837599 14 : 1740837599
15 : 1148851528 15 : 1148851528
16 : 1633630368 16 : 1633630368
17 : 2015727614 17 : 2015727614
18 : 1031602773 18 : 1031602773
19 : 463737465 19 : 463737465
20 : 720848057 20 : 720848057
21 : 1369285272 21 : 1369285272
22 : 1411290150 22 : 1411290150
23 : 2074210785 23 : 2074210785
-------------------------------------|-------------------------------
24 : 2109326622
25 : 1486099418
26 : 1892448847