Cuda Image average filter

前端未结

关注

 3  978

情话喂你 2021-01-31 21:31

Average filter is windowed filter of linear class, that smooths signal (image). The filter works as low-pass one. The basic idea behind filter is for any elemen

3条回答

长发绾君心 (楼主)

2021-01-31 22:15

This is a classic case of embarrassingly parallel image processing problem that can be very easily mapped to CUDA framework. The averaging filter is knows as Box Filter in image processing domains.

The easiest approach would be to use CUDA textures for the filtering process as the boundary conditions can be handled very easily by textures.

Assuming you have source and destination pointers allocated on the host. The procedure would be something like this.

Allocate large enough memory to hold the source and destination images on device.
Copy source image from host to device.
Bind the source image device pointer to texture.
Specify an appropriate block size and a grid large enough to cover every pixel of the image.
Launch the filtering kernel using the specified grid and block size.
Copy the results back to host.
Unbind the texture
Free device pointers.

Sample Implementation Of Box Filter

Kernel

texture tex8u;

//Box Filter Kernel For Gray scale image with 8bit depth
__global__ void box_filter_kernel_8u_c1(unsigned char* output,const int width, const int height, const size_t pitch, const int fWidth, const int fHeight)
{
    int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
    int yIndex = blockIdx.y * blockDim.y + threadIdx.y;

    const int filter_offset_x = fWidth/2;
    const int filter_offset_y = fHeight/2;

    float output_value = 0.0f;

    //Make sure the current thread is inside the image bounds
    if(xIndex(output_value);
    }
}

Wrapper Function:

void box_filter_8u_c1(unsigned char* CPUinput, unsigned char* CPUoutput, const int width, const int height, const int widthStep, const int filterWidth, const int filterHeight)
{

    /*
     * 2D memory is allocated as strided linear memory on GPU.
     * The terminologies "Pitch", "WidthStep", and "Stride" are exactly the same thing.
     * It is the size of a row in bytes.
     * It is not necessary that width = widthStep.
     * Total bytes occupied by the image = widthStep x height.
     */

    //Declare GPU pointer
    unsigned char *GPU_input, *GPU_output;

    //Allocate 2D memory on GPU. Also known as Pitch Linear Memory
    size_t gpu_image_pitch = 0;
    cudaMallocPitch(&GPU_input,&gpu_image_pitch,width,height);
    cudaMallocPitch(&GPU_output,&gpu_image_pitch,width,height);

    //Copy data from host to device.
    cudaMemcpy2D(GPU_input,gpu_image_pitch,CPUinput,widthStep,width,height,cudaMemcpyHostToDevice);

    //Bind the image to the texture. Now the kernel will read the input image through the texture cache.
    //Use tex2D function to read the image
    cudaBindTexture2D(NULL,tex8u,GPU_input,width,height,gpu_image_pitch);

    /*
     * Set the behavior of tex2D for out-of-range image reads.
     * cudaAddressModeBorder = Read Zero
     * cudaAddressModeClamp  = Read the nearest border pixel
     * We can skip this step. The default mode is Clamp.
     */
    tex8u.addressMode[0] = tex8u.addressMode[1] = cudaAddressModeBorder;

    /*
     * Specify a block size. 256 threads per block are sufficient.
     * It can be increased, but keep in mind the limitations of the GPU.
     * Older GPUs allow maximum 512 threads per block.
     * Current GPUs allow maximum 1024 threads per block
     */

    dim3 block_size(16,16);

    /*
     * Specify the grid size for the GPU.
     * Make it generalized, so that the size of grid changes according to the input image size
     */

    dim3 grid_size;
    grid_size.x = (width + block_size.x - 1)/block_size.x;  /*< Greater than or equal to image width */
    grid_size.y = (height + block_size.y - 1)/block_size.y; /*< Greater than or equal to image height */

    //Launch the kernel
    box_filter_kernel_8u_c1<<>>(GPU_output,width,height,gpu_image_pitch,filterWidth,filterHeight);

    //Copy the results back to CPU
    cudaMemcpy2D(CPUoutput,widthStep,GPU_output,gpu_image_pitch,width,height,cudaMemcpyDeviceToHost);

    //Release the texture
    cudaUnbindTexture(tex8u);

    //Free GPU memory
    cudaFree(GPU_input);
    cudaFree(GPU_output);
}

The good news is that you don't have to implement the filter yourself. The CUDA Toolkit comes with free signal and image processing library named NVIDIA Performance Primitives aka NPP, made by NVIDIA. NPP utilizes CUDA enabled GPUs to accelerate processing. The averaging filter is already implemented in NPP. The current version of NPP (5.0) has support for 8 bit, 1 channel and 4 channel images. The functions are:

nppiFilterBox_8u_C1R for 1 channel image.
nppiFilterBox_8u_C4R for 4 channel image.

0 讨论(0)

查看其它3个回答