How to generate random number inside pyCUDA kernel?

前端 未结 2 590
北恋
北恋 2021-01-13 16:45

I am using pyCUDA for CUDA programming. I need to use random number inside kernel function. CURAND library doesn\'t work inside it (pyCUDA). Since, there is lot of work to b

2条回答
  •  南笙
    南笙 (楼主)
    2021-01-13 17:29

    Despite what you assert in your question, PyCUDA has pretty comprehensive support for CUrand. The GPUArray module has a direct interface to fill device memory using the host side API (noting that the random generators run on the GPU in this case).

    It is also perfectly possible to use the device side API from CUrand in PyCUDA kernel code. In this use case the trickiest part is allocating memory for the thread generator states. There are three choices -- statically in code, dynamically using host memory side allocation, and dynamically using device side memory allocation. The following (very lightly tested) example illustrates the latter, seeing as you asked about it in your question:

    import numpy as np
    import pycuda.autoinit
    from pycuda.compiler import SourceModule
    from pycuda import gpuarray
    
    code = """
        #include 
    
        const int nstates = %(NGENERATORS)s;
        __device__ curandState_t* states[nstates];
    
        __global__ void initkernel(int seed)
        {
            int tidx = threadIdx.x + blockIdx.x * blockDim.x;
    
            if (tidx < nstates) {
                curandState_t* s = new curandState_t;
                if (s != 0) {
                    curand_init(seed, tidx, 0, s);
                }
    
                states[tidx] = s;
            }
        }
    
        __global__ void randfillkernel(float *values, int N)
        {
            int tidx = threadIdx.x + blockIdx.x * blockDim.x;
    
            if (tidx < nstates) {
                curandState_t s = *states[tidx];
                for(int i=tidx; i < N; i += blockDim.x * gridDim.x) {
                    values[i] = curand_uniform(&s);
                }
                *states[tidx] = s;
            }
        }
    """
    
    N = 1024
    mod = SourceModule(code % { "NGENERATORS" : N }, no_extern_c=True, arch="sm_52")
    init_func = mod.get_function("_Z10initkerneli")
    fill_func = mod.get_function("_Z14randfillkernelPfi")
    
    seed = np.int32(123456789)
    nvalues = 10 * N
    init_func(seed, block=(N,1,1), grid=(1,1,1))
    gdata = gpuarray.zeros(nvalues, dtype=np.float32)
    fill_func(gdata, np.int32(nvalues), block=(N,1,1), grid=(1,1,1))
    

    Here there is an initialization kernel which needs to be run once to allocate memory for the generator states and initialize them with the seed, and then a kernel which uses those states. You will need to be mindful of malloc heap size limits if you want to run a lot of threads, but those can be manipulated via the PyCUDA driver API interface.

提交回复
热议问题