Deinterleave and convert float to uint16_t efficiently

问题

I need to deinterleave a packed image buffer (YUVA) of floats to planar buffers. I would also like to convert these floats to uint16_t, but this is really slow. My question is: How do I speed this up by using intrinsics?

void deinterleave(char* pixels, int rowBytes, char *bufferY, char *bufferU, char *bufferV, char *bufferA)
{
    // Scaling factors (note min. values are actually negative) (limited range)
    const float yuva_factors[4][2] = {
        { 0.07306f, 1.09132f }, // Y
        { 0.57143f, 0.57143f }, // U
        { 0.57143f, 0.57143f }, // V
        { 0.00000f, 1.00000f }  // A
    };

    float *frameBuffer = (float*)pixels;

    // De-Interleave and convert source buffer / bottom first
    for (int r = height - 1, p = 0; r >= 0; r--)
    {
        for (int c = 0; c < width; c++)
        {
            // Get beginning of next block
            const int pos = r * width * 4 + c * 4;

            // VUYA -> YUVA
            ((uint16_t*)bufferY)[p] = (uint16_t)((frameBuffer[pos + 2] + yuva_factors[0][0]) / (yuva_factors[0][0] + yuva_factors[0][1]) * 65535.0f);
            ((uint16_t*)bufferU)[p] = (uint16_t)((frameBuffer[pos + 1] + yuva_factors[1][0]) / (yuva_factors[1][0] + yuva_factors[1][1]) * 65535.0f);
            ((uint16_t*)bufferV)[p] = (uint16_t)((frameBuffer[pos + 0] + yuva_factors[2][0]) / (yuva_factors[2][0] + yuva_factors[2][1]) * 65535.0f);
            ((uint16_t*)bufferA)[p] = (uint16_t)((frameBuffer[pos + 3] + yuva_factors[3][0]) / (yuva_factors[3][0] + yuva_factors[3][1]) * 65535.0f);

            p++;
        }
    }
}

Just to clarify this: I get the "pixels" buffer from this API function ...

// prSuiteError (*GetPixels)(PPixHand inPPixHand, PrPPixBufferAccess inRequestedAccess, char** outPixelAddress);
char *pixels;
ppixSuite->GetPixels(inRenderedFrame, PrPPixBufferAccess_ReadOnly, &pixels);

... and depending on the selected pixel format it can be anything from uint8_t to float. In this use case it will be definitely floats.

My simplified code looks like this:

#include <stdint.h>

static const int width = 1920;
static const int height = 1080;

void unpackFloatToUint16(float* pixels, uint16_t *bufferY, uint16_t *bufferU, uint16_t *bufferV, uint16_t *bufferA)
{
    for (int r = height - 1; r >= 0; r--)
    {
        for (int c = 0; c < (int)width * 4; c += 4)
        {
            const int pos = r * width * 4 + c;

            *bufferV++ = (uint16_t)((pixels[pos] + 0.57143f) * 57342.98164f);
            *bufferU++ = (uint16_t)((pixels[pos + 1] + 0.57143f) * 57342.98164f);
            *bufferY++ = (uint16_t)((pixels[pos + 2] + 0.07306f) * 56283.17216f);
            *bufferA++ = (uint16_t)(pixels[pos + 3] * 65535.0f);
        }
    }
}

回答1:

One thing is immediately obvious: replace division with multiply. Throughput should increase by a factor of ~7 to ~15 or so if you were bottlenecked on FP division. (Haswell's divss has one per 7 clock throughput, but mulss is one per 0.5 clocks).

(Assuming you weren't already using -ffast-math to let the compiler replace division by a constant with multiplication by the reciprocal for you).

GCC and clang already auto-vectorize your function (at least with compile-time constant height and width. The code in the question doesn't compile because they're not defined, so I didn't know what to assume) See it on the Godbolt compiler explorer. Without -ffast-math, it does use divps for the division, but it does do the math (including conversion to 32-bit integer) with SIMD, with shuffles to mix group 4 Y values together for a 64-bit store. I don't think it does a very efficient job, but if you were bottlenecked on div throughput then it's probably much better than gcc.

But with int height = rowBytes >> 1;, clang doesn't auto-vectorize, while gcc still manages to.

It looks like there's room to improve on what the compilers do, though.

Anyway, lets say we want to manually vectorize for AVX + FMA (e.g. Haswell or Steamroller / Ryzen). You can make other versions, too, but since you didn't specify anything about what microarchitecture you want to target (or even that it was x86), I'm going to just do that as an interesting example.

First, we can transform the (Y + c0) / (c0 + c1) * 65535.0f into a single FMA. Distribute the * (1.0f/(c0+c1)) * 65535.0f inside the addition to get (Y * mul_const + add_const), which can be evaluated with a single FMA. We can do this for all 4 components of a pixel at once, with a 128-bit SIMD FMA with two vector constants holding the coefficients in an order matching the layout of the floats in memory. (Or for all 2 pixels at once with a 256-bit FMA).

Unfortunately gcc and clang don't make this optimization with -ffast-math.

Saving all the shuffling until after converting to integer may work best. It has the advantage that you only need two vectors of FP constants, instead of separate vectors with the coefficient for each component broadcast to all elements. Well I guess you could use FP shuffles on the result of an FMA before converting to integer. (e.g. FMA, then shuffle, then convert a vector of 4 Y values to integer).

FP math is not strictly associative or distributive (because of rounding error) so this may change the result for edge cases. But not necessarily make it worse, just different from your old way of rounding. And converting (Y + const1) * const2 into Y * altconst1 + altconst2 doesn't lose any precision if you do it with an FMA, because FMA doesn't round the internal-temporary product before doing the addition.

So we know how to efficiently do the math and convert to integer (2 CPU instructions for a vector of 8 floats holding 2 pixels). That leaves the shuffling to group Ys together with other Ys, and packing down from 32-bit signed integers to 16-bit unsigned integers. (x86 can only convert between FP and signed integer, until AVX512F which introduces direct FP <-> unsigned (and SIMD for 64-bit integer <-> FP instead of only scalar in 64-bit mode). Anyway, we can't convert directly from float to vectors of 16-bit unsigned integers).

So, given a 128-bit vector of VUYA 32-bit integer elements, our first step could be to narrow to 16-bit integer. x86 has an instruction (SSE4.1 packusdw, intrinsic _mm_packus_epi32) to pack with unsigned saturation (so negative inputs saturate to 0, and large positive inputs saturate to 65535). Presumably this is what you want, instead of truncating the integer which would make overflow wrap around. This takes 2 SIMD vectors as inputs, and produces one output vector, so we'd get VYUAVYUA (for 2 different pixels).

Even if you didn't need the saturation behaviour (e.g. if out of range inputs impossible), packusdw is probably still the most efficient choice for narrowing your integers. Other shuffles only have one input vector, or a fixed shuffle pattern which doesn't throw away the top half of each 32-bit element, so you'd only have 64 bits of useful data in the result after a pshufb or punpck shuffle.

Starting with pack right away is nice to reduce things down to only 2 vectors. I looked at other shuffle orders, but it always took more total shuffles if you start with 32-bit shuffles instead of packing down to 16-bit elements. (See the comments in the godbolt link below)

The 256-bit AVX2 version of vpackusdw operates on the two 128-bit lanes of the 256-bit vector separately, so for pack(ABCD EFGH, IJKL MNOP) you get ABCD IJKL EFGH MNOP. Normally you need another shuffle to put things in the right order. We need further shuffles anyway, but it's still cumbersome. Still, I think you can process twice as much data per loop iteration with only a couple more shuffles in the loop.

Here's what I came up with using 128-bit vectors

source + compiler output on the Godbolt compiler explorer

Note that it doesn't handle the case where the number of pixels isn't a multiple of 4. You could do that with either a cleanup loop (load, scale, and pack with saturation, then extract the four 16-bit components). Or you could do a partially-overlapping last 4 pixels. (No overlap if the number of pixels really is a multiple of 4, otherwise partially overlapping stores into Y, U, V, and A arrays.) This is easy because it's not operating in-place, so you can re-read the same input after storing output.

Also, it assumes that the row stride matches the width, because your code in the question did the same thing. So it doesn't matter if the width is a multiple of 4 pixels or not. But if you did have a variable row-stride separate from width, you'd have to worry about cleanup at the end of each row. (Or use padding so you don't have to).

#include <stdint.h>
#include <immintrin.h>

static const int height = 1024;
static const int width  = 1024;

// helper function for unrolling
static inline __m128i load_and_scale(const float *src)
{  // and convert to 32-bit integer with truncation towards zero.

    // Scaling factors (note min. values are actually negative) (limited range)
    const float yuvaf[4][2] = {
        { 0.07306f, 1.09132f }, // Y
        { 0.57143f, 0.57143f }, // U
        { 0.57143f, 0.57143f }, // V
        { 0.00000f, 1.00000f }  // A
    };

    // (Y + yuvaf[n][0]) / (yuvaf[n][0] + yuvaf[n][1]) ->
    // Y * 1.0f/(yuvaf[n][0] + yuvaf[n][1]) + yuvaf[n][0]/(yuvaf[n][0] + yuvaf[n][1])

    // Pixels are in VUYA order in memory, from low to high address
    const __m128 scale_mul = _mm_setr_ps(
        65535.0f / (yuvaf[2][0] + yuvaf[2][1]),  // V
        65535.0f / (yuvaf[1][0] + yuvaf[1][1]),  // U
        65535.0f / (yuvaf[0][0] + yuvaf[0][1]),  // Y
        65535.0f / (yuvaf[3][0] + yuvaf[3][1])   // A
    );

    const __m128 scale_add = _mm_setr_ps(
        65535.0f * yuvaf[2][0] / (yuvaf[2][0] + yuvaf[2][1]),  // V
        65535.0f * yuvaf[1][0] / (yuvaf[1][0] + yuvaf[1][1]),  // U
        65535.0f * yuvaf[0][0] / (yuvaf[0][0] + yuvaf[0][1]),  // Y
        65535.0f * yuvaf[3][0] / (yuvaf[3][0] + yuvaf[3][1])   // A
    );

    // prefer having src aligned for performance, but with AVX it won't help the compiler much to know it's aligned.
    // So just use an unaligned load intrinsic
    __m128 srcv = _mm_loadu_ps(src);
    __m128 scaled = _mm_fmadd_ps(srcv, scale_mul, scale_add);
    __m128i vuya = _mm_cvttps_epi32(scaled);  // truncate toward zero
    // for round-to-nearest, use cvtps_epi32 instead
    return vuya;
}


void deinterleave_avx_fma(char* __restrict pixels, int rowBytes, char *__restrict bufferY, char *__restrict bufferU, char *__restrict bufferV, char *__restrict bufferA)
{

    const float *src = (float*)pixels;
    uint16_t *__restrict Y = (uint16_t*)bufferY;
    uint16_t *__restrict U = (uint16_t*)bufferU;
    uint16_t *__restrict V = (uint16_t*)bufferV;
    uint16_t *__restrict A = (uint16_t*)bufferA;

    // 4 pixels per loop iteration, loading 4x 16 bytes of floats
    // and storing 4x 8 bytes of uint16_t.
    for (unsigned pos = 0 ; pos < width*height * 4; pos += 4) {
        // pos*4 because each pixel is 4 floats long
        __m128i vuya0 = load_and_scale(src+pos*4);
        __m128i vuya1 = load_and_scale(src+pos*4 + 1);
        __m128i vuya2 = load_and_scale(src+pos*4 + 2);
        __m128i vuya3 = load_and_scale(src+pos*4 + 3);

        __m128i vuya02 = _mm_packus_epi32(vuya0, vuya2);  // vuya0 | vuya2
        __m128i vuya13 = _mm_packus_epi32(vuya1, vuya3);  // vuya1 | vuya3
        __m128i vvuuyyaa01 = _mm_unpacklo_epi16(vuya02, vuya13);   // V0V1 U0U1 | Y0Y1 A0A1
        __m128i vvuuyyaa23 = _mm_unpackhi_epi16(vuya02, vuya13);   // V2V3 U2U3 | Y2Y3 A2A3
        __m128i vvvvuuuu = _mm_unpacklo_epi32(vvuuyyaa01, vvuuyyaa23); // v0v1v2v3 | u0u1u2u3
        __m128i yyyyaaaa = _mm_unpackhi_epi32(vvuuyyaa01, vvuuyyaa23);

         // we have 2 vectors holding our four 64-bit results (or four 16-bit elements each)
         // We can most efficiently store with VMOVQ and VMOVHPS, even though MOVHPS is "for" FP vectors
         // Further shuffling of another 4 pixels to get 128b vectors wouldn't be a win:
         // MOVHPS is a pure store on Intel CPUs, no shuffle uops.
         // And we have more shuffles than stores already.

        //_mm_storeu_si64(V+pos, vvvvuuuu);  // clang doesn't have this (AVX512?) intrinsic
        _mm_storel_epi64((__m128i*)(V+pos), vvvvuuuu);               // MOVQ
        _mm_storeh_pi((__m64*)(U+pos), _mm_castsi128_ps(vvvvuuuu));  // MOVHPS

        _mm_storel_epi64((__m128i*)(Y+pos), yyyyaaaa);
        _mm_storeh_pi((__m64*)(A+pos), _mm_castsi128_ps(yyyyaaaa));
    }
}

Hopefully the variable names + comments for the shuffles should be fairly human-readable. This is untested; the most likely bug would be wrong ordering of some vectors as arguments to a shuffle. But fixing it should just be a matter of reversing arg order or something, not needing extra shuffles that would slow it down.

It looks like 6 shuffles including the packing is about the best I can do. It includes basically a 4x4 transpose to go from vuya x4 to vvvv uuuu yyyy aaaa, and pack has a fixed shuffle pattern which doesn't help with the deinterleave, so I don't think we can do any better than this with 128-bit vectors. It's always possible I overlooked something, of course.

gcc and clang both compile it slightly sub-optimally:

Clang uses vpextrq instead of vmovhps (costing an extra 2 shuffle uops total per loop iteration, on Intel CPUs). And also uses 2 separate loop counters instead of scaling the same counter by 1 or by 8, so that costs 1 extra integer add instruction, for no benefit. (If only gcc had chosen to do that instead of using indexed loads folded into the FMAs... silly compilers.)

gcc, instead of using vmovups loads, deals with FMA3 destroying one if its inputs by copying vector constants and then using a memory operand with an indexed addressing mode. This doesn't stay micro-fused, so it's 4 extra total uops for the front-end.

If it compiled perfectly, with just one loop counter used as an array index for the float source (scaled by *8) and the integer destination arrays (unscaled) like gcc does, and if gcc did the loads the way clang does, then the whole loop would be 24 fused-domain uops on Haswell/Skylake.

So it could issue from the front-end at one iteration per 6 clocks, right at the limit of 4 fused-domain uops per clock. It contains 6 shuffle uops, so it would also be right up against a port5 throughput bottleneck. (HSW / SKL only have 1 shuffle execution unit). So in theory, this can run at 4 pixels (16 floats) per 6 clocks, or one pixel per 1.5 clock cycles on Intel CPUs. Maybe slightly more on Ryzen, although MOVHPS costs multiple uops there, but the pipeline is wider. (See http://agner.org/optimize/)

4 loads and 4 stores per 6 clocks is nice and far away from any bottleneck, except for possibly memory bandwidth if your src and dst aren't hot in cache. The stores are only half width, and everything is sequential. 4 separate output streams from one loop is few enough that it shouldn't usually be a problem. (If you had more than 4 output streams, you might consider fissioning the loop and only storing the U and V values on one pass, and then only the Y and A values on another pass over the same source data, or something. But like I said, 4 output streams is fine and doesn't warrant loop fission.)

A 256-bit version of this might take more than just 2 extra vpermq at the end, because I don't think you can easily work around the fact that two V values you want adjacent in bufferV are stuck in the high and low lane of the same __m256i vector. So you might need an extra 4 vpermd or vperm2i128 shuffles early in the process, because the smallest granularity lane-crossing shuffle is 32-bit granularity. This could hurt Ryzen a lot.

Maybe you could do something using vpblendw to re-arrange word elements between vectors after grouping 4 or 8 vs together, but the wrong vs.

AVX512 doesn't have the same in-lane design for most shuffles, so an AVX512 version of this would be cheaper, I think. AVX512 has narrowing saturate / truncate instructions, but those instructions are slower than vpackusdw on Skylake-X, so maybe not.

来源：https://stackoverflow.com/questions/48151874/deinterleave-and-convert-float-to-uint16-t-efficiently

标签

x86

sse

intrinsics

avx

yuv