问题
I'm trying to implement a row-by-row version of this image downscaling algorithm: http://intel.ly/1avllXm , applied to RGBA 8bit images.
To simplify, consider resizing a single row, w_src -> w_dst. Then each pixel may contribute its value to a single output accumulator with weight 1.0, or contribute to two consecutive output pixels with weights alpha and (1.0f - alpha). In C/pseudo-code:
float acc[w_dst] = malloc(w_dst * 4);
x_dst = 0
for x = 0 .. w_src:
if x is a pivot column:
acc[x_dst] += (w_src[x] * alpha);
x_dst++;
acc[x_dst] += (w_src[x] * (1.0f - alpha);
else
acc[x_dst] += w_src[x];
Finally, divide each accumulator channel by the number of source pixels contributing to it (a float val):
uint8_t dst = malloc(w_dst);
for x_dst = 0 .. w_dst
dst[x_dst] = (uint8_t)round(acc[x_dst] / area);
My reference pure C implementation works correctly. However, I've wondered if there's a way to speed things up using NEON operations (remember that each pixel is 8bit RGBA). Thanks!
回答1:
Unfortunately, NEON isn't very well suited for this kind of job. If it was image resizing with fixed source and destination resolutions, it would be possible to NEONize with dynamic vectors, but summing variable number of adjacent pixels isn't simply SIMDable.
I suggest replacing float arithmetic with fixed point one. That alone will help a lot.
Besides, division takes terribly long. It really harms the performance especially when done inside a loop. You should replace it with a multiplication like :
uint8_t dst = malloc(w_dst);
float area_ret = 1.0f/area;
for x_dst = 0 .. w_dst
dst[x_dst] = (uint8_t)round(acc[x_dst] * area_ret);
回答2:
On my second thought, the vertical downsizing is very well SIMDable because the same arithmetic can be applied to horizontally adjacent pixels.
So here is what I suggest :
- Resize vertically with NEON using q15 unsigned fp arithmetic. The temporary result is stored in 32bits/element.
- Resize horizontally with ARM using q15 unsigned fp arithmetic, divied by area/typecast/pack and store the final result in RGBA.
Please note that the division by area shall be performed in a LONG multiplication with (1/area) in q17.
Why q17? If you do q15*q17, the result is in q32 where two 32bit registers contain the data. And you don't need to do any 'typecasting by bit operations' because the upper register already has the targeted 8bit int value. That's the beauty of fp arithmetic.
Maybe I'll write the fully optimized version of this in near future, completely in assembly.
来源:https://stackoverflow.com/questions/17206315/image-resizing-using-arm-neon