rgb to yuv420 algorithm efficiency

后端未结

关注

 5  2026

I wrote an algorithm to convert a RGB image to a YUV420. I spend a long time trying to make it faster but I haven\'t find any other way to boost its efficiency, so now I turn to

相关标签:

5条回答

小蘑菇

2021-01-30 10:17
The only obvious point I can see is that you're doing 3 * i three times. You could store that result in a local variable but the compiler may well already be doing that. So..
```
r = rgb + 3 * i;
g = rgb + 3 * i + 1;
b = rgb + 3 * i + 2;
```
...becomes:
```
r = rgb + 3 * i;
g = r + 1;
b = g + 1;
```
..although I doubt it'd have much impact.

As ciphor suggests, I think assembly is the only way you're likely to improve upon what you've got there.
0 讨论(0)
发布评论:

提交评论
- 加载中...

天命终不由人

2021-01-30 10:18

Unroll your loop, and get rid of the if in the inner loop. But do not run over the image data 3 times, and it is even faster!

void Bitmap2Yuv420p_calc2(uint8_t *destination, uint8_t *rgb, size_t width, size_t height)
{
    size_t image_size = width * height;
    size_t upos = image_size;
    size_t vpos = upos + upos / 4;
    size_t i = 0;

    for( size_t line = 0; line < height; ++line )
    {
        if( !(line % 2) )
        {
            for( size_t x = 0; x < width; x += 2 )
            {
                uint8_t r = rgb[3 * i];
                uint8_t g = rgb[3 * i + 1];
                uint8_t b = rgb[3 * i + 2];

                destination[i++] = ((66*r + 129*g + 25*b) >> 8) + 16;

                destination[upos++] = ((-38*r + -74*g + 112*b) >> 8) + 128;
                destination[vpos++] = ((112*r + -94*g + -18*b) >> 8) + 128;

                r = rgb[3 * i];
                g = rgb[3 * i + 1];
                b = rgb[3 * i + 2];

                destination[i++] = ((66*r + 129*g + 25*b) >> 8) + 16;
            }
        }
        else
        {
            for( size_t x = 0; x < width; x += 1 )
            {
                uint8_t r = rgb[3 * i];
                uint8_t g = rgb[3 * i + 1];
                uint8_t b = rgb[3 * i + 2];

                destination[i++] = ((66*r + 129*g + 25*b) >> 8) + 16;
            }
        }
    }
}

In my tests, this was about 25% faster than your accepted answer (VS 2010, depending on whether x86 or x64 is enabled.)

0 讨论(0)

庸人自扰

2021-01-30 10:22

I guess the lookup tables are superfluous. The respective multiplications should be faster than a memory access. Especially in such an inner loop.

Then, I would also apply some small changes (as others already have suggested)..:

void Bitmap2Yuv420p( boost::uint8_t *destination, boost::uint8_t *rgb,
                     const int &width, const int &height ) {
  const size_t image_size = width * height;
  const size_t upos = image_size;
  const size_t vpos = upos + upos / 4;
  for( size_t i = 0; i < image_size; ++i ) {
    boost::uint8_t r = rgb[3*i  ];
    boost::uint8_t g = rgb[3*i+1];
    boost::uint8_t b = rgb[3*i+2];
    destination[i] = ( ( 66*r + 129*g + 25*b ) >> 8 ) + 16;
    if (!((i / width) % 2) && !(i % 2)) {
      destination[upos++] = ( ( -38*r + -74*g + 112*b ) >> 8) + 128;
      destination[vpos++] = ( ( 112*r + -94*g + -18*b ) >> 8) + 128;
    }
  }
}

EDIT

You should also rearrange the code, so that you can remove the if(). Small, simple inner loops without branches are fast. Here, it may be a good idea to first write Y plane, then U and V planes, like this:

void Bitmap2Yuv420p( boost::uint8_t *destination, boost::uint8_t *rgb,
                     const int &width, const int &height ) {
  const size_t image_size = width * height;
  boost::uint8_t *dst_y = destination;
  boost::uint8_t *dst_u = destination + image_size;
  boost::uint8_t *dst_v = destination + image_size + image_size/4;

  // Y plane
  for( size_t i = 0; i < image_size; ++i ) {
    *dst_y++ = ( ( 66*rgb[3*i] + 129*rgb[3*i+1] + 25*rgb[3*i+2] ) >> 8 ) + 16;
  }
#if 1
  // U plane
  for( size_t y=0; y<height; y+=2 ) {
    for( size_t x=0; x<width; x+=2 ) {
      const size_t i = y*width + x;
      *dst_u++ = ( ( -38*rgb[3*i] + -74*rgb[3*i+1] + 112*rgb[3*i+2] ) >> 8 ) + 128;
  }
  // V plane
  for( size_t y=0; y<height; y+=2 ) {
    for( size_t x=0; x<width; x+=2 ) {
      const size_t i = y*width + x;
      *dst_v++ = ( ( 112*rgb[3*i] + -94*rgb[3*i+1] + -18*rgb[3*i+2] ) >> 8 ) + 128;
  }
#else // also try this version:
  // U+V planes
  for( size_t y=0; y<height; y+=2 ) {
    for( size_t x=0; x<width; x+=2 ) {
      const size_t i = y*width + x;
      *dst_u++ = ( ( -38*rgb[3*i] + -74*rgb[3*i+1] + 112*rgb[3*i+2] ) >> 8 ) + 128;
      *dst_v++ = ( ( 112*rgb[3*i] + -94*rgb[3*i+1] + -18*rgb[3*i+2] ) >> 8 ) + 128;
  }
#endif
}

0 讨论(0)

遇见更好的自我

2021-01-30 10:31
Do not access pointers more then once, copy the value to the stack and then use the value on the stack. (Aliasing)
```
...
int v_r = *r;
int v_g = *g;
int v_b = *b;

*y = ((lookup66[v_r] + lookup129[v_g] + lookup25[v_b]) >> 8) + 16;
...
```
On the other hand, you can do it in SSE without look-up tables and would do 8 pixels at once.
0 讨论(0)
发布评论:

提交评论
- 加载中...
囚心锁ツ

2021-01-30 10:31

You can use SSE or 3dNow assembly codes to further boost the performance.

As for c++ code, I think it is hard to improve based on your current code.

0 讨论(0)
发布评论:

提交评论
- 加载中...