I wrote an algorithm to convert a RGB image to a YUV420. I spend a long time trying to make it faster but I haven\'t find any other way to boost its efficiency, so now I turn to
I guess the lookup tables are superfluous. The respective multiplications should be faster than a memory access. Especially in such an inner loop.
Then, I would also apply some small changes (as others already have suggested)..:
void Bitmap2Yuv420p( boost::uint8_t *destination, boost::uint8_t *rgb,
const int &width, const int &height ) {
const size_t image_size = width * height;
const size_t upos = image_size;
const size_t vpos = upos + upos / 4;
for( size_t i = 0; i < image_size; ++i ) {
boost::uint8_t r = rgb[3*i ];
boost::uint8_t g = rgb[3*i+1];
boost::uint8_t b = rgb[3*i+2];
destination[i] = ( ( 66*r + 129*g + 25*b ) >> 8 ) + 16;
if (!((i / width) % 2) && !(i % 2)) {
destination[upos++] = ( ( -38*r + -74*g + 112*b ) >> 8) + 128;
destination[vpos++] = ( ( 112*r + -94*g + -18*b ) >> 8) + 128;
}
}
}
EDIT
You should also rearrange the code, so that you can remove the if()
. Small, simple inner loops without branches are fast. Here, it may be a good idea to first write Y plane, then U and V planes, like this:
void Bitmap2Yuv420p( boost::uint8_t *destination, boost::uint8_t *rgb,
const int &width, const int &height ) {
const size_t image_size = width * height;
boost::uint8_t *dst_y = destination;
boost::uint8_t *dst_u = destination + image_size;
boost::uint8_t *dst_v = destination + image_size + image_size/4;
// Y plane
for( size_t i = 0; i < image_size; ++i ) {
*dst_y++ = ( ( 66*rgb[3*i] + 129*rgb[3*i+1] + 25*rgb[3*i+2] ) >> 8 ) + 16;
}
#if 1
// U plane
for( size_t y=0; y> 8 ) + 128;
}
// V plane
for( size_t y=0; y> 8 ) + 128;
}
#else // also try this version:
// U+V planes
for( size_t y=0; y> 8 ) + 128;
*dst_v++ = ( ( 112*rgb[3*i] + -94*rgb[3*i+1] + -18*rgb[3*i+2] ) >> 8 ) + 128;
}
#endif
}