I\'m trying to perform image colour conversion from YCbCr to BGRA (Don\'t ask about the A bit, such a headache).
Anyway, this needs to perform as fast as possible, so I\
You may be bandwidth limited here, as there is very little computation relative to the number of loads and stores.
One suggestion: get rid of the _mm_prefetch
intrinsics - they are almost certainly not helping and may even hinder operation on more recent CPUs (which already do a pretty good job with automatic prefetching).
Another area to look at:
__m128i sK = _mm_set_epi32(m_pKBuffer[i], m_pKBuffer[i+1], m_pKBuffer[i+2], m_pKBuffer[i+3]);
__m128i sY = _mm_set_epi32(pSrc8u[0][i], pSrc8u[0][i+1], pSrc8u[0][i+2], pSrc8u[0][i+3]);
__m128i sU = _mm_set_epi32((char)pSrc8u[1][i], (char)pSrc8u[1][i+1], (char)pSrc8u[1][i+2], (char)pSrc8u[1][i+3]);
__m128i sV = _mm_set_epi32((char)pSrc8u[2][i], (char)pSrc8u[2][i+1], (char)pSrc8u[2][i+2], (char)pSrc8u[2][i+3]);
This is generating a lot of unnecessary instructions - you should be using _mm_load_xxx
and _mm_unpackxx_xxx
here. It will look like more code, but it will be a lot more efficient. And you should probably be processing 16 pixels per iteration of the loop, rather than 4 - that way you load a vector of 8 bit values once, and unpack to get each set of 4 values as a vector of 32 bit ints as needed.