Faster alpha blending using a lookup table?

∥☆過路亽.° 提交于 2019-12-04 20:15:45
ronag

I would be interested in seeing some benchmarks.

There is an algorithm that can do perfect alpha blending without any floating point calculations or lookup tables. You can find more info in the following document (the algorithm and code is described at the end)

I also did an SSE implementation of this long ago, if you are interested...

void PreOver_SSE2(void* dest, const void* source1, const void* source2, size_t size)
{
    static const size_t STRIDE = sizeof(__m128i)*4;
    static const u32 PSD = 64;

    static const __m128i round = _mm_set1_epi16(128);
    static const __m128i lomask = _mm_set1_epi32(0x00FF00FF);

    assert(source1 != NULL && source2 != NULL && dest != NULL);
    assert(size % STRIDE == 0);

    const __m128i* source128_1 = reinterpret_cast<const __m128i*>(source1);
    const __m128i* source128_2 = reinterpret_cast<const __m128i*>(source2);
    __m128i*       dest128 = reinterpret_cast<__m128i*>(dest);  

    __m128i d, s, a, rb, ag, t;

    for(size_t k = 0, length = size/STRIDE; k < length; ++k)    
    {
        // TODO: put prefetch between calculations?(R.N)
        _mm_prefetch(reinterpret_cast<const s8*>(source128_1+PSD), _MM_HINT_NTA);
        _mm_prefetch(reinterpret_cast<const s8*>(source128_2+PSD), _MM_HINT_NTA);   

        // work on entire cacheline before next prefetch
        for(int n = 0; n < 4; ++n, ++dest128, ++source128_1, ++source128_2)
        {
            // TODO: assembly optimization use PSHUFD on moves before calculations, lower latency than MOVDQA (R.N) http://software.intel.com/en-us/articles/fast-simd-integer-move-for-the-intel-pentiumr-4-processor/

            // TODO: load entire cacheline at the same time? are there enough registers? 32 bit mode (special compile for 64bit?) (R.N)
            s = _mm_load_si128(source128_1);        // AABGGRR
            d = _mm_load_si128(source128_2);        // AABGGRR

            // PRELERP(S, D) = S+D - ((S*D[A]+0x80)>>8)+(S*D[A]+0x80))>>8
            // T = S*D[A]+0x80 => PRELERP(S,D) = S+D - ((T>>8)+T)>>8

            // set alpha to lo16 from dest_
            a = _mm_srli_epi32(d, 24);          // 000000AA 
            rb = _mm_slli_epi32(a, 16);         // 00AA0000
            a = _mm_or_si128(rb, a);            // 00AA00AA

            rb = _mm_and_si128(lomask, s);      // 00BB00RR     
            rb = _mm_mullo_epi16(rb, a);        // BBBBRRRR 
            rb = _mm_add_epi16(rb, round);      // BBBBRRRR
            t = _mm_srli_epi16(rb, 8);          
            t = _mm_add_epi16(t, rb);
            rb = _mm_srli_epi16(t, 8);          // 00BB00RR 

            ag = _mm_srli_epi16(s, 8);          // 00AA00GG     
            ag = _mm_mullo_epi16(ag, a);        // AAAAGGGG     
            ag = _mm_add_epi16(ag, round);
            t = _mm_srli_epi16(ag, 8);
            t = _mm_add_epi16(t, ag);
            ag = _mm_andnot_si128(lomask, t);   // AA00GG00     

            rb = _mm_or_si128(rb, ag);          // AABGGRR      pack

            rb = _mm_sub_epi8(s, rb);           // sub S-[(D[A]*S)/255]
            d = _mm_add_epi8(d, rb);            // add D+[S-(D[A]*S)/255]

            _mm_stream_si128(dest128, d);
        }
    }   
    _mm_mfence();   //ensure last WC buffers get flushed to memory      
}

Today's processors can do a lot of calculation in the time it takes to get one value from memory, especially if it's not in cache. That makes it especially important to benchmark possible solutions, because you can't easily reason what the result will be.

I don't know why you're concerned about floating point conversions, this can all be done as integers.

BYTE r1, r2, rDest;
BYTE g1, g2, gDest;
BYTE b1, b2, bDest;

BYTE av; // Alpha value BYTE
rem = 255 - av; // Remaining fraction

rDest = (r1*rem + r2*av) / 255;
gDest = (g1*rem + g2*av) / 255;
bDest = (b1*rem + b2*av) / 255; 

If you want to get really clever, you can replace the divide with a multiply followed by a right-shift.

Edit: Here's the version using a right-shift. Adding a constant might reduce any truncation errors, but I'll leave that as an exercise for the reader.

BYTE r1, r2, rDest;
BYTE g1, g2, gDest;
BYTE b1, b2, bDest;

BYTE av; // Alpha value BYTE
int aMult = 0x10000 * av / 255;
rem = 0x10000 - aMult; // Remaining fraction

rDest = (r1*rem + r2*aMult) >> 16;
gDest = (g1*rem + g2*aMult) >> 16;
bDest = (b1*rem + b2*aMult) >> 16; 

Mixing colors is computationally trivial. I would be surprised if this yielded a significant benefit, but as always, you must first prove that this calculation is a bottleneck in performance in the first place. And I would suggest that a better solution is to use hardware accelerated blending.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!