Fast Gaussian blur on unsigned char image- ARM Neon Intrinsics- iOS Dev

问题

Can someone tell me a fast function to find the gaussian blur of an image using a 5x5 mask. I need it for iOS app dev. I am working directly on the memory of the image defined as

unsigned char *image_sqr_Baseaaddr = (unsigned char *) malloc(noOfPixels);

for (row = 2; row < H-2; row++) 
{
    for (col = 2; col < W-2; col++) 
    {
        newPixel = 0;
        for (rowOffset=-2; rowOffset<=2; rowOffset++)
        {
            for (colOffset=-2; colOffset<=2; colOffset++) 
            {
                rowTotal = row + rowOffset;
                colTotal = col + colOffset;
                iOffset = (unsigned long)(rowTotal*W + colTotal);
                newPixel += (*(imgData + iOffset)) * gaussianMask[2 + rowOffset][2 + colOffset];
            }
        }
        i = (unsigned long)(row*W + col);
        *(imgData + i) = newPixel / 159;
    }
}

This is obviously the slowest function possible. I heard that ARM Neon intrinsics on the iOS can be used to make several operations in 1 cycle. Maybe that's the way to go ?

The problem is that I am not very familiar and don't have enough time to learn assembly language at the moment. So it would be great if anyone can post a Neon intrinsics code for the problem mentioned above or any other fast implementation in C/C++.

回答1:

Before you get into SIMD optimisation with NEON you should first improve your scalar implementation. The biggest problem with your code as it stands is that it has been implemented as if it were a non-separable filter, whereas a Gaussian kernel is separable. By switching to a separable implementation you reduce the number of operations form N^2 to 2N, which in your case of a 5x5 kernel would be a reduction from 25 multiply-adds to 10, i.e. a 2.5x speed up for very little effort.

It may be that a sufficiently optimised scalar implementation will meet your needs without the need to resort to SIMD. If not then you can at least carry these scalar optimisations over into a vectorized implementation.

http://en.wikipedia.org/wiki/Gaussian_blur

http://blogs.mathworks.com/steve/2006/11/28/separable-convolution-part-2/

回答2:

Separate your kernel, as described by Paul R.
Don't re-invent the wheel. Use vImage, which is part of the Accelerate framework, and implements a vectorized, multi-threaded convolution for you. Specifically, it seems like you want the function vImageConvolve_Planar8.

来源：https://stackoverflow.com/questions/9158818/fast-gaussian-blur-on-unsigned-char-image-arm-neon-intrinsics-ios-dev

标签

image-processing

ios5

arm

gaussian

neon