2D Image Indexing Bug in CUDA Kernel

前端未结

关注

 1  1252

I\'m doing linear filtering on images using CUDA. I use 2D thread blocks and 2D grid to make the problem natural. Here\'s how I index: (height and width

相关标签:

1条回答

轮回少年

2021-01-21 00:43

There are quite a few things you still haven't described very well, but based on the information you have posted, I built what I am guessing is a reasonable repro case with parameters which match a case you say it failing (450 x 364 with filterSize=5):

#include <stdio.h> #include <assert.h> template<int filterSize> __global__ void filter_8u_c1_kernel(unsigned char* in, unsigned char* out, int width, int height, float* filter, int fSize) { unsigned int xIndex = blockIdx.x*blockDim.x + threadIdx.x; unsigned int yIndex = blockIdx.y*blockDim.y + threadIdx.y; unsigned int tid = yIndex * width + xIndex; unsigned int N = filterSize/2; if(yIndex>=height-N || xIndex>=width-N || yIndex<N || xIndex<N) return; out[tid] = in[tid]; } int main(void) { const int width = 450, height = 365, filterSize=5; const size_t isize = sizeof(unsigned char) * size_t(width * height); unsigned char * _in, * _out, * out; assert( cudaMalloc((void **)&_in, isize) == cudaSuccess ); assert( cudaMalloc((void **)&_out, isize) == cudaSuccess ); assert( cudaMemset(_in, 'Z', isize) == cudaSuccess ); assert( cudaMemset(_out, 'A', isize) == cudaSuccess ); const dim3 BlockDim(16,16); dim3 GridDim; GridDim.x = (width + BlockDim.x - 1) / BlockDim.x; GridDim.y = (height + BlockDim.y - 1) / BlockDim.y; filter_8u_c1_kernel<filterSize><<<GridDim,BlockDim>>>(_in,_out,width,height,0,0); assert( cudaPeekAtLastError() == cudaSuccess ); out = (unsigned char *)malloc(isize); assert( cudaMemcpy(out, _out, isize, cudaMemcpyDeviceToHost) == cudaSuccess); for(int i=0; i<width; i++) { fprintf(stdout, "%d: ", i); for(int j=0; j<height; j++) { unsigned int idx = i + j*width; fprintf(stdout, "%c", out[idx]); } fprintf(stdout, "\n"); } return cudaThreadExit(); }

When run it does exactly what I would expect, overwriting the output memory with the input everywhere except for the first and last two lines and the first and last two entries in all the lines in between. This is running with CUDA 3.2 on OS X 10.6.5 with a compute 1.2 GPU. So whatever is happening in you code, it isn't happening in my repro case, which either means I have misinterpreted what you have written, or there is something else you haven't described which is causing the problem.

0 讨论(0)

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复