CUDA In-place Transpose Error

那年仲夏 提交于 2019-12-05 18:03:11

Your in-place kernel is overwriting data in the image that will be subsequently picked up by another thread to use for its transpose operation. So for a square image, you should buffer the destination data before overwriting it, then place the destination data in it's proper transposed location. Since we're doing effectively 2 copies per thread using this method, there's only a need to use half as many threads. Something like this should work:

template<typename T, int blockSize>
__global__ void kernel_transpose_inplace(T* srcDst, int width, int pitch)
{

    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;

    int tid_in = row * pitch + col;
    int tid_out = col * pitch + row;

    if((row < width) && (col < width) && (row<col)) {

        T temp = srcDst[tid_out];

        srcDst[tid_out] = srcDst[tid_in];
        srcDst[tid_in] = temp;
        }
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!