I wrote an algorithm to convert a RGB image to a YUV420. I spend a long time trying to make it faster but I haven\'t find any other way to boost its efficiency, so now I turn to
The only obvious point I can see is that you're doing 3 * i
three times. You could store that result in a local variable but the compiler may well already be doing that. So..
r = rgb + 3 * i;
g = rgb + 3 * i + 1;
b = rgb + 3 * i + 2;
...becomes:
r = rgb + 3 * i;
g = r + 1;
b = g + 1;
..although I doubt it'd have much impact.
As ciphor suggests, I think assembly is the only way you're likely to improve upon what you've got there.
Unroll your loop, and get rid of the if in the inner loop. But do not run over the image data 3 times, and it is even faster!
void Bitmap2Yuv420p_calc2(uint8_t *destination, uint8_t *rgb, size_t width, size_t height)
{
size_t image_size = width * height;
size_t upos = image_size;
size_t vpos = upos + upos / 4;
size_t i = 0;
for( size_t line = 0; line < height; ++line )
{
if( !(line % 2) )
{
for( size_t x = 0; x < width; x += 2 )
{
uint8_t r = rgb[3 * i];
uint8_t g = rgb[3 * i + 1];
uint8_t b = rgb[3 * i + 2];
destination[i++] = ((66*r + 129*g + 25*b) >> 8) + 16;
destination[upos++] = ((-38*r + -74*g + 112*b) >> 8) + 128;
destination[vpos++] = ((112*r + -94*g + -18*b) >> 8) + 128;
r = rgb[3 * i];
g = rgb[3 * i + 1];
b = rgb[3 * i + 2];
destination[i++] = ((66*r + 129*g + 25*b) >> 8) + 16;
}
}
else
{
for( size_t x = 0; x < width; x += 1 )
{
uint8_t r = rgb[3 * i];
uint8_t g = rgb[3 * i + 1];
uint8_t b = rgb[3 * i + 2];
destination[i++] = ((66*r + 129*g + 25*b) >> 8) + 16;
}
}
}
}
In my tests, this was about 25% faster than your accepted answer (VS 2010, depending on whether x86 or x64 is enabled.)
I guess the lookup tables are superfluous. The respective multiplications should be faster than a memory access. Especially in such an inner loop.
Then, I would also apply some small changes (as others already have suggested)..:
void Bitmap2Yuv420p( boost::uint8_t *destination, boost::uint8_t *rgb,
const int &width, const int &height ) {
const size_t image_size = width * height;
const size_t upos = image_size;
const size_t vpos = upos + upos / 4;
for( size_t i = 0; i < image_size; ++i ) {
boost::uint8_t r = rgb[3*i ];
boost::uint8_t g = rgb[3*i+1];
boost::uint8_t b = rgb[3*i+2];
destination[i] = ( ( 66*r + 129*g + 25*b ) >> 8 ) + 16;
if (!((i / width) % 2) && !(i % 2)) {
destination[upos++] = ( ( -38*r + -74*g + 112*b ) >> 8) + 128;
destination[vpos++] = ( ( 112*r + -94*g + -18*b ) >> 8) + 128;
}
}
}
EDIT
You should also rearrange the code, so that you can remove the if()
. Small, simple inner loops without branches are fast. Here, it may be a good idea to first write Y plane, then U and V planes, like this:
void Bitmap2Yuv420p( boost::uint8_t *destination, boost::uint8_t *rgb,
const int &width, const int &height ) {
const size_t image_size = width * height;
boost::uint8_t *dst_y = destination;
boost::uint8_t *dst_u = destination + image_size;
boost::uint8_t *dst_v = destination + image_size + image_size/4;
// Y plane
for( size_t i = 0; i < image_size; ++i ) {
*dst_y++ = ( ( 66*rgb[3*i] + 129*rgb[3*i+1] + 25*rgb[3*i+2] ) >> 8 ) + 16;
}
#if 1
// U plane
for( size_t y=0; y<height; y+=2 ) {
for( size_t x=0; x<width; x+=2 ) {
const size_t i = y*width + x;
*dst_u++ = ( ( -38*rgb[3*i] + -74*rgb[3*i+1] + 112*rgb[3*i+2] ) >> 8 ) + 128;
}
// V plane
for( size_t y=0; y<height; y+=2 ) {
for( size_t x=0; x<width; x+=2 ) {
const size_t i = y*width + x;
*dst_v++ = ( ( 112*rgb[3*i] + -94*rgb[3*i+1] + -18*rgb[3*i+2] ) >> 8 ) + 128;
}
#else // also try this version:
// U+V planes
for( size_t y=0; y<height; y+=2 ) {
for( size_t x=0; x<width; x+=2 ) {
const size_t i = y*width + x;
*dst_u++ = ( ( -38*rgb[3*i] + -74*rgb[3*i+1] + 112*rgb[3*i+2] ) >> 8 ) + 128;
*dst_v++ = ( ( 112*rgb[3*i] + -94*rgb[3*i+1] + -18*rgb[3*i+2] ) >> 8 ) + 128;
}
#endif
}
Do not access pointers more then once, copy the value to the stack and then use the value on the stack. (Aliasing)
...
int v_r = *r;
int v_g = *g;
int v_b = *b;
*y = ((lookup66[v_r] + lookup129[v_g] + lookup25[v_b]) >> 8) + 16;
...
On the other hand, you can do it in SSE without look-up tables and would do 8 pixels at once.
You can use SSE or 3dNow assembly codes to further boost the performance.
As for c++ code, I think it is hard to improve based on your current code.