Efficiently compute max of an array of 8 elements in arm neon

问题

How do I find max element in array of 8 bytes, 8 shorts or 8 ints? I may need just the position of the max element, value of the max element, or both of them.

For example:

unsigned FindMax8(const uint32_t src[8]) // returns position of max element
{
    unsigned ret = 0;
    for (unsigned i=0; i<8; ++i)
    {
        if (src[i] > src[ret])
            ret = i;
    }
    return ret;
}

At -O2 clang unrolls the loop but it does not use neon, which should give decent perf boost (because it eliminates many data dependent branches?)

For 8 bytes and 8 shorts approach should be simpler as entire array can be loaded into a single q-register. For arm64 this should be much simpler with vmaxv_u16, but how do I make it efficient in 32-bit neon?

As noted by Marc in comments, when function is changed to return max value GCC auto vectorizer generates the following for neon64:

ldr q0, [x0, 16]
ld1r {v2.4s}, [x0]
ldr q1, [x0]
umax v0.4s, v0.4s, v2.4s
umax v0.4s, v0.4s, v1.4s
umaxv s0, v0.4s
umov w0, v0.s[0]

I have one function that does quite complex math and at the end of computation I end up with uint32x4_t res result and all I need is to get index of the max element. This single piece is the slowest part of the code, by far slower than the rest of the rest of this math-heavy function.

I tried three different approaches (from slowest to fastest according to profiler):

full computation using neon with final single 32-bit result transfer from neon to arm.
vst1q_u32(src, res) and then using regular C code to find index of the max element.
vmov to four 32-bit arm registers using vget_lane_u64 two times and then using some bit-shifts to figure out index of the max element.

Here's fastest version that I was able to get:

unsigned compute(unsigned short *input)
{
    uint32x4_t result = vld1q_u32((uint32_t*)(input));
    // some computations...
    // ... and at the end I end up with res01 and res23
    // and I need to get index of max element from them:
    uint32x2_t res01 = vget_low_u32(result);
    uint32x2_t res23 = vget_high_u32(result);

    // real code below:
    uint64_t xres01 = vget_lane_u64(vreinterpret_u64_u32(res01), 0);
    uint64_t xres23 = vget_lane_u64(vreinterpret_u64_u32(res23), 0);
    unsigned ret = 0;
    uint32_t xmax0 = (uint32_t)(xres01 & 0xffffffff);
    uint32_t xmax1 = (uint32_t)(xres01 >> 32);
    uint32_t xmax2 = (uint32_t)(xres23 & 0xffffffff);
    uint32_t xmax3 = (uint32_t)(xres23 >> 32);
    if (xmax1 > xmax0)
    {
        xmax0 = xmax1;
        ret = 1;
    }
    if (xmax2 > xmax0)
    {
        xmax0 = xmax2;
        ret = 2;
    }
    if (xmax3 > xmax0)
        ret = 3;
    return ret;
}

Version using full neon computation does this:

using vmax/vpmax find max element
set u32x4_t to the max element
using vceq set max elements to 0xffffffff
load u32x4_t mask with with {1u<<31, 1u<<30, 1u<<29, 1u<<28 }
do vand with the mask
pairwise add or vorr to collapse all 4 values to a single one.
using vclz set all to index of the max element

Maybe issue somewhere else, see actual code that I'm trying to optimize. My optimized version where only the last piece needs to be improved. Somehow profiler shows that 80% time is spent in the last lines where I compute max index. Any ideas? Changing that simple c-loop to pairs of regs improves entire function by 20-30%. Note, according to profiler the two vst1_u32 are the ones where function spents most of the time.

What other approach could I try?

Update: It seems that slow down at the end of the function isn't related to the code. I'm not sure why, but when I tried to run different versions of the function depending on the order in which I called them I got timings change 3-4x times. Also, with different testing it seem that full neon version is fastest if there is no stall at the end of the function and I'm not sure why that stall happen. For that reason I created a new question to figure out why.

来源：https://stackoverflow.com/questions/49928749/efficiently-compute-max-of-an-array-of-8-elements-in-arm-neon

标签

c++

arm

intrinsics

neon