发表新帖

发表新帖

Speeding up some SSE2 Intrinsics for color conversion

前端未结

关注

 1  1040

I\'m trying to perform image colour conversion from YCbCr to BGRA (Don\'t ask about the A bit, such a headache).

Anyway, this needs to perform as fast as possible, so I\

相关标签:

1条回答

[愿得一人]

2021-02-03 16:18
You may be bandwidth limited here, as there is very little computation relative to the number of loads and stores.

One suggestion: get rid of the _mm_prefetch intrinsics - they are almost certainly not helping and may even hinder operation on more recent CPUs (which already do a pretty good job with automatic prefetching).

Another area to look at:
```
__m128i sK = _mm_set_epi32(m_pKBuffer[i],           m_pKBuffer[i+1],            m_pKBuffer[i+2],            m_pKBuffer[i+3]);
__m128i sY = _mm_set_epi32(pSrc8u[0][i],            pSrc8u[0][i+1],             pSrc8u[0][i+2],             pSrc8u[0][i+3]);
__m128i sU = _mm_set_epi32((char)pSrc8u[1][i],      (char)pSrc8u[1][i+1],       (char)pSrc8u[1][i+2],       (char)pSrc8u[1][i+3]);
__m128i sV = _mm_set_epi32((char)pSrc8u[2][i],      (char)pSrc8u[2][i+1],       (char)pSrc8u[2][i+2],       (char)pSrc8u[2][i+3]);
```
This is generating a lot of unnecessary instructions - you should be using _mm_load_xxx and _mm_unpackxx_xxx here. It will look like more code, but it will be a lot more efficient. And you should probably be processing 16 pixels per iteration of the loop, rather than 4 - that way you load a vector of 8 bit values once, and unpack to get each set of 4 values as a vector of 32 bit ints as needed.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题