CUDA In-place Transpose Error
I'm implementing a CUDA program for transposing an image. I created 2 kernels. The first kernel does out of place transposition and works perfectly for any image size. Then I created a kernel for in-place transposition of square images. However, the output is incorrect. The lower triangle of the image is transposed but the upper triangle remains the same. The resulting image has a stairs like pattern in the diagonal and the size of each step of the stairs is equal to the 2D block size which I used for my kernel. Out-of-Place Kernel: Works perfectly for any image size if src and dst are