memset in cuda that allows to set values within kernel

问题

i am making several cudamemset calls in order to set my values to 0 as below:

void allocateByte( char **gStoreR,const int byte){

    char **cStoreR = (char **)malloc(N * sizeof(char*));

    for( int i =0 ; i< N ; i++){
        char *c;
        cudaMalloc((void**)&c, byte*sizeof(char));

        cudaMemset(c,0,byte);
        cStoreR[i] = c;
    }
    cudaMemcpy(gStoreR, cStoreR, N * sizeof(char *), cudaMemcpyHostToDevice);
}

However, this is proving to be very slow. Is there a memset function on the GPU as calling it from CPU takes lot of time. Also, does cudaMalloc((void**)&c, byte*sizeof(char)) automatically set bits that c points to to 0.

回答1:

Every cudaMemset call launches a kernel, so if N is large and byte is small, then you will have a lot of kernel launch overhead slowing down the code. There is no device side memset, so the solution would be to write a kernel which traverses the allocations and zeros the storage in a single launch.

As an aside, I would strongly recommend against using a structure of arrays in CUDA. It is a lot slower and much more complex to manage that achieving the same outcome using a single large block of linear memory and indexing into that memory. In your example, it would reduce the code to a single cudaMalloc call and a single cudaMemset call. On the device side, pointer indirection, which is slow, gets replaced by a few integer operations, which are very fast. If your source material on the host is an array of structures, I would recommend using something like the excellent thrust::zip_iterator to get the data into a GPU friendly form on the device.

来源：https://stackoverflow.com/questions/7846832/memset-in-cuda-that-allows-to-set-values-within-kernel

标签

cuda

parallel-processing

nvidia