AVX/SSE version of xorshift128+

前端 未结 2 1360
醉酒成梦
醉酒成梦 2020-12-06 06:24

I am trying to make the fastest possible high quality RNG. Having read http://xorshift.di.unimi.it/ , xorshift128+ seems like a good option. The C code is

#         


        
相关标签:
2条回答
  • 2020-12-06 06:49

    XorShift is indeed a good choice. It is so good, so fast and requires so little state that I'm surprised to see so little adoption. It should be the standard generator on all platforms. I have implemented it myself 8 years ago and even then it could generate 800MB/s of random bytes.

    You cannot use vector instructions to speed up generating a single random number. There is too little instruction-level parallelism in those few instructions.

    But you can easily speed up generating N numbers where N is the vector size of your target instruction set. Just run N generators in parallel. Keep state for N generators and generate N numbers at the same time.

    If client code demands numbers one at a time you could keep a buffer of N (or more) numbers. If the buffer is empty you fill it using vector instructions. If the buffer is not empty you just return the next number.

    0 讨论(0)
  • 2020-12-06 07:03

    For anyone else who might reach this question, I think this C++ code implements correctly 4 xorshift128plus generators running in parallel, using AVX2:

    __m256i xorshift128plus_avx2(__m256i &state0, __m256i &state1)
    {
        __m256i s1 = state0;
        const __m256i s0 = state1;
        state0 = s0;
        s1 = _mm256_xor_si256(s1, _mm256_slli_epi64(s1, 23));
        state1 = _mm256_xor_si256(_mm256_xor_si256(_mm256_xor_si256(s1, s0),
                                                   _mm256_srli_epi64(s1, 18)),
                                  _mm256_srli_epi64(s0, 5));
        return _mm256_add_epi64(state1, s0);
    }
    

    The scalar implementation I used is:

    u64 xorshift128plus(u64 &state0, u64 &state1)
    {
        u64 s1 = state0;
        const u64 s0 = state1;
        state0 = s0;
        s1 ^= s1 << 23;                              // a
        state1 = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5); // b, c
        return state1 + s0;
    }
    

    Which is the same one in the xorshiftplus paper. Note that the right-shift constants from the original question do not correspond to the ones in the paper.

    0 讨论(0)
提交回复
热议问题