Image resizing using ARM NEON

耗尽温柔 提交于 2019-12-24 10:59:16

问题


I'm trying to implement a row-by-row version of this image downscaling algorithm: http://intel.ly/1avllXm , applied to RGBA 8bit images.

To simplify, consider resizing a single row, w_src -> w_dst. Then each pixel may contribute its value to a single output accumulator with weight 1.0, or contribute to two consecutive output pixels with weights alpha and (1.0f - alpha). In C/pseudo-code:

float acc[w_dst] = malloc(w_dst * 4);
x_dst = 0
for x = 0 .. w_src:
  if x is a pivot column:
     acc[x_dst] += (w_src[x] * alpha);
     x_dst++;
     acc[x_dst] += (w_src[x] * (1.0f - alpha);
  else
     acc[x_dst] += w_src[x];

Finally, divide each accumulator channel by the number of source pixels contributing to it (a float val):

uint8_t dst = malloc(w_dst);
for x_dst = 0 .. w_dst
  dst[x_dst] = (uint8_t)round(acc[x_dst] / area);

My reference pure C implementation works correctly. However, I've wondered if there's a way to speed things up using NEON operations (remember that each pixel is 8bit RGBA). Thanks!


回答1:


Unfortunately, NEON isn't very well suited for this kind of job. If it was image resizing with fixed source and destination resolutions, it would be possible to NEONize with dynamic vectors, but summing variable number of adjacent pixels isn't simply SIMDable.

I suggest replacing float arithmetic with fixed point one. That alone will help a lot.

Besides, division takes terribly long. It really harms the performance especially when done inside a loop. You should replace it with a multiplication like :

uint8_t dst = malloc(w_dst);
float area_ret = 1.0f/area;
for x_dst = 0 .. w_dst
  dst[x_dst] = (uint8_t)round(acc[x_dst] * area_ret);



回答2:


On my second thought, the vertical downsizing is very well SIMDable because the same arithmetic can be applied to horizontally adjacent pixels.

So here is what I suggest :

  • Resize vertically with NEON using q15 unsigned fp arithmetic. The temporary result is stored in 32bits/element.
  • Resize horizontally with ARM using q15 unsigned fp arithmetic, divied by area/typecast/pack and store the final result in RGBA.

Please note that the division by area shall be performed in a LONG multiplication with (1/area) in q17.

Why q17? If you do q15*q17, the result is in q32 where two 32bit registers contain the data. And you don't need to do any 'typecasting by bit operations' because the upper register already has the targeted 8bit int value. That's the beauty of fp arithmetic.

Maybe I'll write the fully optimized version of this in near future, completely in assembly.



来源:https://stackoverflow.com/questions/17206315/image-resizing-using-arm-neon

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!