my image processing project works with grayscale images. I have ARM Cortex-A8 processor platform. I want to make use of the NEON.
I have a grayscale image( consider
There is not instruction that can load your 4 8bit value into 4 32bit register.
you must load them and then use a vshl twice. because neon can't use 32 registers you'll have to work on 8 pixels (and not 4)
You can use only 16bits register. it should be enough...
Load the 4 bytes using a single-lane load instruction (vld1 <register>[<lane>], [<address]
) into a q-register, then use two move-long instructions (vmovl
) to promote them first to 16 and then to 32 bit. The result should be something like (in GNU syntax)
vld1 d0[0], [<address>] @Now d0 = (*<addr>, *<addr+1>, *<addr+2>, *<addr+3>, <junk>, ... <junk> )
vmovl.u8 q0, d0 @Now q1 = (d0, d1) = ((uint16_t)*<addr>, ... (uint16_t)*<addr+3>, <junk>, ... <junk>)
vmovl.u16 q0, d2 @Now d0 = ((uint32_t)*<addr>, ... (uint32_t)*<addr+3>), d1 = (<junk>, ... <junk>)
If you can guarantee that <address>
is 4-byte aligned, then write [<address>: 32]
instead in the load instruction, to save a cycle or two. If you do that and the address isn't aligned, you'll get a fault, however.
Um, I just realized you want to use intrinsics, not assembly, so here's the same thing with intrinsics.
uint32x4_t v8; // Will actually hold 4 uint8_t
v8 = vld1_lane_u32(ptr, v8, 0);
const uint16x4_t v16 = vget_low_u16(vmovl_u8(vreinterpret_u8_u32(v8)));
const uint32x4_t v32 = vmovl_u16(v16);
I will recommend that you spend a bit of time understanding how SIMD works on ARM. Look at:
Take a look at:
to get you started. You can then implement your SIMD code using inline assembler or corresponding ARM intrinsics recommended by domen.
If you need to sum up to 480 8-bit values then you would technically need 17 bits of intermediate storage. However, if you perform the additions in two stages, ie, top 240 rows then bottom 240 rows, you can do it in 16-bits each. Then you can add the results from the two halves to get the final answer.
There is actually a NEON instruction that is suitable for your algorithm called vaddw. It will add a dword vector to a qword vector, with the latter containing elements that are twice as wide as the former. In your case, vaddw.u8 can be used to add 8 pixels to 8 16-bit accumulators. Then, vaddw.u16 can be used to add the two sets of 8 16-bit accumulators into one set of 8 32-bit ones - note that you must use the instruction twice to get both halves.
If necessary, you can also convert the values back to 16-bit or 8-bit by using vmovn or vqmovn.
Depends on your compiler and (possible lack of) extensions.
Ie. for GCC, this might be a starting point: http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html