Optimizing horizontal boolean reduction in ARM NEON

时光总嘲笑我的痴心妄想 提交于 2021-02-18 10:59:09

问题


I'm experimenting with a cross-platform SIMD library ala ecmascript_simd aka SIMD.js, and part of this is providing a few "horizontal" SIMD operations. In particular, the API that library offers includes any(<boolN x M>) -> bool and all(<boolN x M>) -> bool functions, where <T x K> is a vector of K elements of type T and boolN is an N-bit boolean, i.e. all ones or all zeros, as SSE and NEON return for their comparison operations.

For example, let v be a <bool32 x 4> (a 128-bit vector), it could be the result of VCLT.S32 or something. I'd like to compute all(v) = v[0] && v[1] && v[2] && v[3] and any(v) = v[0] || v[1] || v[2] || v[3].

This is easy with SSE, e.g. movmskps will extract the high bit of each element, so all for the type above becomes (with C intrinsics):

#include<xmmintrin.h>
int all(__m128 x) {
    return _mm_movemask_ps(x) == 8 + 4 + 2 + 1;
}

and similarly for any.

I'm struggling to find obvious/nice/efficient ways to implement this with NEON, which doesn't support an instruction like movmskps. There's the approach of simply extracting each element and computing with scalars. E.g. there's the naive method but there's also the approach of using the "horizontal" operations NEON supports, like VPMAX and VPMIN.

#include<arm_neon.h>

int all_naive(uint32x4_t v) {
    return v[0] && v[1] && v[2] && v[3];
}
int all_horiz(uint32x4_t v) {
    uint32x2_t x = vpmin_u32(vget_low_u32(v),
                             vget_high_u32(v));
    uint32x2_t y = vpmin_u32(x, x);
    return x[0] != 0;
}

(One can do a similar thing for the latter with VPADD, which may be faster, but it's fundamentally the same idea.)

Are there are other tricks one can use to implement this?


Yes, I know that horizontal operations are not great with SIMD vector units. But sometimes it is useful, e.g. many SIMD implementations of mandlebrot will operate on 4 points at once, and bail out of the inner loop when all of them are out of range... which requires doing a comparison and then a horizontal and.


回答1:


NOTE: first time looking at arm today, I might be wrong about things.

UPD: Removed ARM-V7 and will write up what we ended up doing in a separate answer

ARM-V8.

For ARM-V8, have a look at this strlen implementation from glibc: https://code.woboq.org/userspace/glibc/sysdeps/aarch64/multiarch/strlen_asimd.S.html

ARM-V8 introduced reductions across registers. Here they use min to compare with 0

        uminv        datab2, datav.16b
        mov          tmp1, datav2.d[0]
        cbnz         tmp1, L(main_loop)

Find the smallest char, compare with 0 - take the next 16 bytes.

There are a few other reductions in ARM-V8 like vaddvq_u8.
I'm pretty sure you can do most of the things you'd want from movemask and alike with this.

Another interesting thing here is how they find the first_true

        /* Set te NULL byte as 0xff and the rest as 0x00, move the data into a
           pair of scalars and then compute the length from the earliest NULL
           byte.  */
        cmeq        datav.16b, datav.16b, #0
        mov        data1, datav.d[0]
        mov        data2, datav.d[1]
        cmp        data1, 0
        csel        data1, data1, data2, ne
        sub        len, src, srcin
        rev        data1, data1
        add        tmp2, len, 8
        clz        tmp1, data1
        csel        len, len, tmp2, ne
        add        len, len, tmp1, lsr 3

Looks a bit intimidating, but my understanding is:

  1. they narrow it down to a 64 bit number just by doing if/else (if the first half doesn't have the zero - the second half does.
  2. use count leading zeroes to find the position (didn't quite understand all of the endianness stuff here but it's libc - so this is the correct one).

So - if you only need V8 - there is a solution.




回答2:


This is my current solution that is implemented in eve library.

If your backend has C++20 support, you can just use the library: it has implementations for arm-v7, arm-v8 (only little endian at the moment) and all x86 from sse2 to avx-512. It's open source and MIT licensed. In beta at the moment. Feel free to reach out (for example with an issue) if you are trying out the library.

Take everything with a grain of salt - I don't yet have the arm benchmarks set up.

NOTE: On top of basic all and any we also have a movemask equivalent to do more complex operations like first_true. That wasn't part of the question and it's not amazing but the code can be found here

ARM-V7, 8 bytes register

Now, arm-v7 is 32 bit architecture, so we try to get to 32 bit elements where we can.

  • any

Use pairwise 32 bit max. If any element is true, the max is true.

// cast to dwords
dwords = vpmax_u32(dwords, dwords);
return vget_lane_u32(dwords, 0);
  • all

Pairwise min instead of max. Also what you test against changes. If you have 4 byte element - just test for true. If shorts or chars - you need to test for -1;

// cast to dwords
dwords = vpmin_u32(dwords, dwords);
std::uint32_t combined = vget_lane_u32(dwords, 0);

// Assuming T is your scalar type
if constexpr ( sizeof(T) >= 4 ) return combined;

// I decided that !~ is better than -1, compiler will figure it out.
return !~combined; 

ARM-V7, 16 bytes register

For anything bigger than chars, just do a conversion to a 64 bit one. Here is the list of vector narrow integer conversions.

For chars, the best I found is to reinterpret as uint32 and do an extra check. So compare for == -1 for all and > 0 for any. Seemed nicer asm the split in two 8 byte registers.

Then just do all/any on that dword register.

ARM-v8, 8 byte

ARM-v8 has 64 bit support, so you can just get a 64 bit lane. That one is trivially testable.

ARM-v8, 16 byte

We use vmaxvq_u32 since there is not a 64 bit one for any and vminvq_u32, vminvq_u16 or vminvq_u8 for all depending on the element size. (Which is similar to glibc strlen)

Conclusion

Lack of benchmarks definitely makes me worried, some instructions are problematic sometimes and I don't know about it. Regardless, that's the best I've got, so far at least.



来源:https://stackoverflow.com/questions/31197216/optimizing-horizontal-boolean-reduction-in-arm-neon

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!