Average filter is windowed filter of linear class, that smooths signal (image). The filter works as low-pass one. The basic idea behind filter is for any elemen
Some Basic thoughts/steps:
You should be able to scale this pretty easily with 2D-memory and multidimensional kernel-calls.
This is a classic case of embarrassingly parallel image processing problem that can be very easily mapped to CUDA framework. The averaging filter is knows as Box Filter in image processing domains.
The easiest approach would be to use CUDA textures for the filtering process as the boundary conditions can be handled very easily by textures.
Assuming you have source and destination pointers allocated on the host. The procedure would be something like this.
Kernel
texture<unsigned char, cudaTextureType2D> tex8u;
//Box Filter Kernel For Gray scale image with 8bit depth
__global__ void box_filter_kernel_8u_c1(unsigned char* output,const int width, const int height, const size_t pitch, const int fWidth, const int fHeight)
{
int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
const int filter_offset_x = fWidth/2;
const int filter_offset_y = fHeight/2;
float output_value = 0.0f;
//Make sure the current thread is inside the image bounds
if(xIndex<width && yIndex<height)
{
//Sum the window pixels
for(int i= -filter_offset_x; i<=filter_offset_x; i++)
{
for(int j=-filter_offset_y; j<=filter_offset_y; j++)
{
//No need to worry about Out-Of-Range access. tex2D automatically handles it.
output_value += tex2D(tex8u,xIndex + i,yIndex + j);
}
}
//Average the output value
output_value /= (fWidth * fHeight);
//Write the averaged value to the output.
//Transform 2D index to 1D index, because image is actually in linear memory
int index = yIndex * pitch + xIndex;
output[index] = static_cast<unsigned char>(output_value);
}
}
Wrapper Function:
void box_filter_8u_c1(unsigned char* CPUinput, unsigned char* CPUoutput, const int width, const int height, const int widthStep, const int filterWidth, const int filterHeight)
{
/*
* 2D memory is allocated as strided linear memory on GPU.
* The terminologies "Pitch", "WidthStep", and "Stride" are exactly the same thing.
* It is the size of a row in bytes.
* It is not necessary that width = widthStep.
* Total bytes occupied by the image = widthStep x height.
*/
//Declare GPU pointer
unsigned char *GPU_input, *GPU_output;
//Allocate 2D memory on GPU. Also known as Pitch Linear Memory
size_t gpu_image_pitch = 0;
cudaMallocPitch<unsigned char>(&GPU_input,&gpu_image_pitch,width,height);
cudaMallocPitch<unsigned char>(&GPU_output,&gpu_image_pitch,width,height);
//Copy data from host to device.
cudaMemcpy2D(GPU_input,gpu_image_pitch,CPUinput,widthStep,width,height,cudaMemcpyHostToDevice);
//Bind the image to the texture. Now the kernel will read the input image through the texture cache.
//Use tex2D function to read the image
cudaBindTexture2D(NULL,tex8u,GPU_input,width,height,gpu_image_pitch);
/*
* Set the behavior of tex2D for out-of-range image reads.
* cudaAddressModeBorder = Read Zero
* cudaAddressModeClamp = Read the nearest border pixel
* We can skip this step. The default mode is Clamp.
*/
tex8u.addressMode[0] = tex8u.addressMode[1] = cudaAddressModeBorder;
/*
* Specify a block size. 256 threads per block are sufficient.
* It can be increased, but keep in mind the limitations of the GPU.
* Older GPUs allow maximum 512 threads per block.
* Current GPUs allow maximum 1024 threads per block
*/
dim3 block_size(16,16);
/*
* Specify the grid size for the GPU.
* Make it generalized, so that the size of grid changes according to the input image size
*/
dim3 grid_size;
grid_size.x = (width + block_size.x - 1)/block_size.x; /*< Greater than or equal to image width */
grid_size.y = (height + block_size.y - 1)/block_size.y; /*< Greater than or equal to image height */
//Launch the kernel
box_filter_kernel_8u_c1<<<grid_size,block_size>>>(GPU_output,width,height,gpu_image_pitch,filterWidth,filterHeight);
//Copy the results back to CPU
cudaMemcpy2D(CPUoutput,widthStep,GPU_output,gpu_image_pitch,width,height,cudaMemcpyDeviceToHost);
//Release the texture
cudaUnbindTexture(tex8u);
//Free GPU memory
cudaFree(GPU_input);
cudaFree(GPU_output);
}
The good news is that you don't have to implement the filter yourself. The CUDA Toolkit comes with free signal and image processing library named NVIDIA Performance Primitives aka NPP, made by NVIDIA. NPP utilizes CUDA enabled GPUs to accelerate processing. The averaging filter is already implemented in NPP. The current version of NPP (5.0) has support for 8 bit, 1 channel and 4 channel images. The functions are:
nppiFilterBox_8u_C1R
for 1 channel image.nppiFilterBox_8u_C4R
for 4 channel image.If the filter's size is normal and not humongous, the average filter is a very good case for implementing with CUDA. You can set this up using square blocks and every thread of the block is responsible for the calculation of the value of one pixel, by summing and averaging its neighbors.
If you store the image in Global Memory then it can be programmed easily. One possible optimization is that you load blocks of the image into the block's Shared Memory. Using phantom elements (so that you won't exceed the shared block's dimensions when looking for neighboring pixels) you can calculate the average of the pixels within a block.
The only think that you have to be careful of is how the "stitching" will be done in the end, because the shared memory blocks will overlap (because of the extra "padding" pixels) and you don't want to calculate their values twice.