Why is cuFFT so slow?

前端未结

关注

 1  1913

I\'m hoping to accelerate a computer vision application that computes many FFTs using FFTW and OpenMP on an Intel CPU. However, for a variety of FFT problem sizes, I\'ve found t

相关标签:

1条回答

臣服心动

2021-02-06 13:17
Question might be outdated, though here is a possible explanation (for the slowness of cuFFT).

When structuring your data for cufftPlanMany, the data arrangement is not very nice with the GPU. Indeed, using an istride and ostride of 32 means no data read is coalesced. See here for details on the read pattern
```
input[b * idist + (x * inembed[1] + y) * istride]
output[b * odist + (x * onembed[1] + y) * ostride]
```
in which case if i/ostride is 32, it will very unlikely be coalesced/optimal. (indeed b is the batch number). Here are the changes I applied:
```
    CHECK_CUFFT(cufftPlanMany(&forwardPlan,
              2, //rank
              n, //dimensions = {nRows, nCols}
              inembed, //inembed
              1,  // WAS: depth, //istride
              nRows*cols_padded, // WAS: 1, //idist
              onembed, //onembed
              1, // WAS: depth, //ostride
              nRows*cols_padded, // WAS:1, //odist
              CUFFT_R2C, //cufftType
              depth /*batch*/));
```
Running this, I entered a unspecified launch failure because of illegal memory access. You might want to change the memory allocation (cufftComplex is two floats, you need an x2 in your allocation size - looks like a typo).
```
// WAS : CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth)); 
CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth*2)); 
```
When running it this way, I got a x8 performance improvement on my card.
0 讨论(0)
发布评论:

提交评论
- 加载中...