Speeding up some SSE2 Intrinsics for color conversion

前端 未结 1 1030
北海茫月
北海茫月 2021-02-03 15:38

I\'m trying to perform image colour conversion from YCbCr to BGRA (Don\'t ask about the A bit, such a headache).

Anyway, this needs to perform as fast as possible, so I\

相关标签:
1条回答
  • 2021-02-03 16:18

    You may be bandwidth limited here, as there is very little computation relative to the number of loads and stores.

    One suggestion: get rid of the _mm_prefetch intrinsics - they are almost certainly not helping and may even hinder operation on more recent CPUs (which already do a pretty good job with automatic prefetching).

    Another area to look at:

    __m128i sK = _mm_set_epi32(m_pKBuffer[i],           m_pKBuffer[i+1],            m_pKBuffer[i+2],            m_pKBuffer[i+3]);
    __m128i sY = _mm_set_epi32(pSrc8u[0][i],            pSrc8u[0][i+1],             pSrc8u[0][i+2],             pSrc8u[0][i+3]);
    __m128i sU = _mm_set_epi32((char)pSrc8u[1][i],      (char)pSrc8u[1][i+1],       (char)pSrc8u[1][i+2],       (char)pSrc8u[1][i+3]);
    __m128i sV = _mm_set_epi32((char)pSrc8u[2][i],      (char)pSrc8u[2][i+1],       (char)pSrc8u[2][i+2],       (char)pSrc8u[2][i+3]);
    

    This is generating a lot of unnecessary instructions - you should be using _mm_load_xxx and _mm_unpackxx_xxx here. It will look like more code, but it will be a lot more efficient. And you should probably be processing 16 pixels per iteration of the loop, rather than 4 - that way you load a vector of 8 bit values once, and unpack to get each set of 4 values as a vector of 32 bit ints as needed.

    0 讨论(0)
提交回复
热议问题