I am trying to modify the imageDenosing class in CUDA SDK, I need to repeat the filter many time incase to capture the time. But my code doesn\'t work properly.
//st
I already answered this for you when you posted the same question previously - you need to wait for a kernel to complete before running it again - add:
cudaThreadSynchronize(); // *** wait for kernel to complete ***
after the kernel call.
Your kernel is running asynchronously - you need to wait for it to complete, e.g.
cudaMalloc((void **)&dst2, size);
cudaMemcpy(dst2, dst, imageW * imageH * sizeof(TColor), cudaMemcpyHostToDevice);
F1D<<<grid2, threads2>>>(dst, imageW, imageH, dst2);
cudaThreadSynchronize(); // *** wait for kernel to complete ***
cudaFree(dst2);
The statement
image[imageW * iy + ix] = buffer[imageW * iy + ix];
is causing the problem. You are overwriting your input image in the kernel. So depending on thread execution order, you would be further blurring parts of the image.
Also, I don't see the purpose of
cudaMemcpy(dst2, dst, imageW*imageH*sizeof(TColor),cudaMemcpyHostToDevice);
dst
looks to be device memory since you have access to it in the cuda kernal.