Debug data/neon performance hazards in arm neon code

问题

Originally the problem appeared when I tried to optimize an algorithm for neon arm and some minor part of it was taking 80% of according to profiler. I tried to test to see what can be done to improve it and for that I created array of function pointers to different versions of my optimized function and then I run them in the loop to see in profiler which one performs better:

typedef unsigned(*CalcMaxFunc)(const uint16_t a[8][4], const uint16_t b[4][4]);
CalcMaxFunc CalcMaxFuncs[] =
{
    CalcMaxFunc_NEON_0,
    CalcMaxFunc_NEON_1,
    CalcMaxFunc_NEON_2,
    CalcMaxFunc_NEON_3,
    CalcMaxFunc_C_0
};


int N = sizeof(CalcMaxFunc) / sizeof(CalcMaxFunc[0]);
for (int i = 0; i < 10 * N; ++i)
{
    auto f = CalcMaxFunc[i % N];
    unsigned retI = f(a, b);

    // just random code to ensure that cpu waits for the results
    // and compiler doesn't optimize it away
    if (retI > 1000000)
        break;
    ret |= retI;
}

I got surprising results: performance of a function was totally depend on its position within CalcMaxFuncs array. For example, when I swapped CalcMaxFunc_NEON_3 to be first it would be 3-4 times slower and according to profiler it would stall at the last bit of the function where it tried to move data from neon to arm register.

So, what does it make stall sometimes and not in other times? BY the way, I profile on iPhone6 in xcode if that matters.

When I intentionally introduced neon pipeline stalls by mixing-in some floating point division between calling these functions in the loop I eliminated unreliable behavior, now all of them perform the same regardless of the order in which they were called. So, why in the first place did I have that problem and what can I do to eliminate it in actual code?

Update: I tried to create a simple test function and then optimize it in stages and see how I could possibly avoid neon->arm stalls. Here's the test runner function:

void NeonStallTest()
{
    int findMinErr(uint8_t* var1, uint8_t* var2, int size);
    srand(0);
    uint8_t var1[1280];
    uint8_t var2[1280];
    for (int i = 0; i < sizeof(var1); ++i)
    {
        var1[i] = rand();
        var2[i] = rand();
    }
#if 0 // early exit?
    for (int i = 0; i < 16; ++i)
        var1[i] = var2[i];
#endif
    int ret = 0;
    for (int i=0; i<10000000; ++i)
        ret += findMinErr(var1, var2, sizeof(var1));
    exit(ret);
}

And findMinErr is this:

int findMinErr(uint8_t* var1, uint8_t* var2, int size)
{
    int ret = 0;
    int ret_err = INT_MAX;
    for (int i = 0; i < size / 16; ++i, var1 += 16, var2 += 16)
    {
        int err = 0;
        for (int j = 0; j < 16; ++j)
        {
            int x = var1[j] - var2[j];
            err += x * x;
        }
        if (ret_err > err)
        {
            ret_err = err;
            ret = i;
        }
    }
    return ret;
}

Basically it it does sum of squared difference between each uint8_t[16] block and returns index of the block pair that has lowest squared difference. So, then I rewrote it in neon intrisics (no particular attempt was made to make it fast, as it's not the point):

int findMinErr_NEON(uint8_t* var1, uint8_t* var2, int size)
{
    int ret = 0;
    int ret_err = INT_MAX;
    for (int i = 0; i < size / 16; ++i, var1 += 16, var2 += 16)
    {
        int err;
        uint8x8_t var1_0 = vld1_u8(var1 + 0);
        uint8x8_t var1_1 = vld1_u8(var1 + 8);
        uint8x8_t var2_0 = vld1_u8(var2 + 0);
        uint8x8_t var2_1 = vld1_u8(var2 + 8);
        int16x8_t s0 = vreinterpretq_s16_u16(vsubl_u8(var1_0, var2_0));
        int16x8_t s1 = vreinterpretq_s16_u16(vsubl_u8(var1_1, var2_1));
        uint16x8_t u0 = vreinterpretq_u16_s16(vmulq_s16(s0, s0));
        uint16x8_t u1 = vreinterpretq_u16_s16(vmulq_s16(s1, s1));
#ifdef __aarch64__1
        err = vaddlvq_u16(u0) + vaddlvq_u16(u1);
#else
        uint32x4_t err0 = vpaddlq_u16(u0);
        uint32x4_t err1 = vpaddlq_u16(u1);
        err0 = vaddq_u32(err0, err1);
        uint32x2_t err00 = vpadd_u32(vget_low_u32(err0), vget_high_u32(err0));
        err00 = vpadd_u32(err00, err00);
        err = vget_lane_u32(err00, 0);
#endif

        if (ret_err > err)
        {
            ret_err = err;
            ret = i;
#if 0 // enable early exit?
            if (ret_err == 0)
                break;
#endif
        }
    }
    return ret;
}

Now, if (ret_err > err) is clearly data hazard. Then I manually "unrolled" loop by two and modified code to use err0 and err1 and check them after performing next round of compute. According to profiler I got some improvements. In simple neon loop I got roughly 30% of entire function spent in the two lines vget_lane_u32 followed by if (ret_err > err). After I unrolled by two these operations started to take 25% (e.g. I got roughly 10% overall speedup). Also, check armv7 version, there is only 8 instructions between when err0 is set (vmov.32 r6, d16[0]) and when it's accessed (cmp r12, r6). T

Note, in the code early exit is ifdefed out. Enabling it would make function even slower. If I unrolled it by four and changed to use four errN variables and deffer check by two rounds then I still saw vget_lane_u32 in profiler taking too much time. When I checked generated asm, appears that compiler destroys all the optimizations attempts because it reuses some of the errN registers which effectively makes CPU access results of vget_lane_u32 much earlier than I want (and I aim to delay access by 10-20 instructions). Only when I unrolled by 4 and marked all four errN as volatile vget_lane_u32 totally disappeared from the radar in profiler, however, the if (ret_err > errN) check obviously got slow as hell as now these probably ended up as regular stack variables overall these 4 checks in 4x manual loop unroll started to take 40%. Looks like with proper manual asm it's possible to make it work properly: have early loop exit, while avoiding neon->arm stalls and have some arm logic in the loop, however, extra maintenance required to deal with arm asm makes it 10x more complex to maintain that kind of code in a large project (that doesn't have any armasm).

Update:

Here's sample stall when moving data from neon to arm register. To implement early exist I need to move from neon to arm once per loop. This move alone takes more than 50% of entire function according to sampling profiler that comes with xcode. I tried to add lots of noops before and/or after the mov, but nothing seems to affect results in profiler. I tried to use vorr d0,d0,d0 for noops: no difference. What's the reason for the stall, or the profiler simply shows wrong results?

来源：https://stackoverflow.com/questions/49951100/debug-data-neon-performance-hazards-in-arm-neon-code

标签

optimization

arm

profiling

neon