GPU library that implements Image Convolution using cuFFT?

问题

I've been using the image convolution function from Nvidia Performance Primitives (NPP). However, my kernel is fairly large with respect to the image size, and I've heard rumors that NPP's convolution is a direct convolution instead of an FFT-based convolution. (I don't think the NPP source code is available, so I'm not sure how it's implemented.)

I'd like to see how fast a cuFFT-based convolution function could run in the image processing application that I'm working on.

You might say "hey, just put your image into cuFFT and see how fast it is!" And if I were using Matlab, you'd be right--it's a one-line call in Matlab:

%assuming the images are padded
convolved = ifft2(fft2(image).* fft2(filter));

However, there's a lot of boiler-plate stuff needed to get cuFFT to do image convolution. So, I'm looking for code that does a cuFFT-based convolution and abstracts away the implementation. And, indeed, I did find a few things:

This github repo has a file called cufft_sample.cu. I thought the code looked promising, but I found an other file in the repo containing comments that say convolution implementation is producing incorrect results:

    WARNING: GpuFFTConvOp currently don't return the good answer
    TODO: extend to cover more case, as in many case we will crash!

I had it in my head that the Kitware VTK/ITK codebase provided cuFFT-based image convolution. Alas, it turns out that (at best) doing cuFFT-based routines is planned for future releases.
I found some code on the Matlab File Exchange that does 2D convolution. The important parts are implemented in C/CUDA, but there's a Matlab wrapper. I'm working on stripping away the Matlab wrapper in favor of pure C/C++/CUDA, but I'm still curious whether there are any solutions that are more elegant and/or proven.

Any recommendations among these three options?

What else is out there in terms of pre-built code that does cuFFT-based image convolution?

回答1:

You could try arrayfire.

In ArrayFire, you can do the following.

array image(rows, columns, h_image);
array filter(frows, fcols, h_filter);
array res = convolve(image, filter);

Depending on the size of the filter, the conolve command either uses cufft or a faster hand tuned kernel. If you prefer to use fft2, you could do the following

array res = ifft2(fft2(image) * fft2(filter));

But I highly recommend you use convolve instead because it has been optimized to get the best performance out of cufft.