Why is cuFFT so slow?

前端 未结 1 1914
耶瑟儿~
耶瑟儿~ 2021-02-06 13:06

I\'m hoping to accelerate a computer vision application that computes many FFTs using FFTW and OpenMP on an Intel CPU. However, for a variety of FFT problem sizes, I\'ve found t

1条回答
  •  臣服心动
    2021-02-06 13:17

    Question might be outdated, though here is a possible explanation (for the slowness of cuFFT).

    When structuring your data for cufftPlanMany, the data arrangement is not very nice with the GPU. Indeed, using an istride and ostride of 32 means no data read is coalesced. See here for details on the read pattern

    input[b * idist + (x * inembed[1] + y) * istride]
    output[b * odist + (x * onembed[1] + y) * ostride]
    

    in which case if i/ostride is 32, it will very unlikely be coalesced/optimal. (indeed b is the batch number). Here are the changes I applied:

        CHECK_CUFFT(cufftPlanMany(&forwardPlan,
                  2, //rank
                  n, //dimensions = {nRows, nCols}
                  inembed, //inembed
                  1,  // WAS: depth, //istride
                  nRows*cols_padded, // WAS: 1, //idist
                  onembed, //onembed
                  1, // WAS: depth, //ostride
                  nRows*cols_padded, // WAS:1, //odist
                  CUFFT_R2C, //cufftType
                  depth /*batch*/));
    

    Running this, I entered a unspecified launch failure because of illegal memory access. You might want to change the memory allocation (cufftComplex is two floats, you need an x2 in your allocation size - looks like a typo).

    // WAS : CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth)); 
    CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth*2)); 
    

    When running it this way, I got a x8 performance improvement on my card.

    0 讨论(0)
提交回复
热议问题