rgb to yuv420 algorithm efficiency

后端 未结 5 2015
北荒
北荒 2021-01-30 10:05

I wrote an algorithm to convert a RGB image to a YUV420. I spend a long time trying to make it faster but I haven\'t find any other way to boost its efficiency, so now I turn to

相关标签:
5条回答
  • 2021-01-30 10:17

    The only obvious point I can see is that you're doing 3 * i three times. You could store that result in a local variable but the compiler may well already be doing that. So..

    r = rgb + 3 * i;
    g = rgb + 3 * i + 1;
    b = rgb + 3 * i + 2;
    

    ...becomes:

    r = rgb + 3 * i;
    g = r + 1;
    b = g + 1;
    

    ..although I doubt it'd have much impact.

    As ciphor suggests, I think assembly is the only way you're likely to improve upon what you've got there.

    0 讨论(0)
  • 2021-01-30 10:18

    Unroll your loop, and get rid of the if in the inner loop. But do not run over the image data 3 times, and it is even faster!

    void Bitmap2Yuv420p_calc2(uint8_t *destination, uint8_t *rgb, size_t width, size_t height)
    {
        size_t image_size = width * height;
        size_t upos = image_size;
        size_t vpos = upos + upos / 4;
        size_t i = 0;
    
        for( size_t line = 0; line < height; ++line )
        {
            if( !(line % 2) )
            {
                for( size_t x = 0; x < width; x += 2 )
                {
                    uint8_t r = rgb[3 * i];
                    uint8_t g = rgb[3 * i + 1];
                    uint8_t b = rgb[3 * i + 2];
    
                    destination[i++] = ((66*r + 129*g + 25*b) >> 8) + 16;
    
                    destination[upos++] = ((-38*r + -74*g + 112*b) >> 8) + 128;
                    destination[vpos++] = ((112*r + -94*g + -18*b) >> 8) + 128;
    
                    r = rgb[3 * i];
                    g = rgb[3 * i + 1];
                    b = rgb[3 * i + 2];
    
                    destination[i++] = ((66*r + 129*g + 25*b) >> 8) + 16;
                }
            }
            else
            {
                for( size_t x = 0; x < width; x += 1 )
                {
                    uint8_t r = rgb[3 * i];
                    uint8_t g = rgb[3 * i + 1];
                    uint8_t b = rgb[3 * i + 2];
    
                    destination[i++] = ((66*r + 129*g + 25*b) >> 8) + 16;
                }
            }
        }
    }
    

    In my tests, this was about 25% faster than your accepted answer (VS 2010, depending on whether x86 or x64 is enabled.)

    0 讨论(0)
  • 2021-01-30 10:22

    I guess the lookup tables are superfluous. The respective multiplications should be faster than a memory access. Especially in such an inner loop.

    Then, I would also apply some small changes (as others already have suggested)..:

    void Bitmap2Yuv420p( boost::uint8_t *destination, boost::uint8_t *rgb,
                         const int &width, const int &height ) {
      const size_t image_size = width * height;
      const size_t upos = image_size;
      const size_t vpos = upos + upos / 4;
      for( size_t i = 0; i < image_size; ++i ) {
        boost::uint8_t r = rgb[3*i  ];
        boost::uint8_t g = rgb[3*i+1];
        boost::uint8_t b = rgb[3*i+2];
        destination[i] = ( ( 66*r + 129*g + 25*b ) >> 8 ) + 16;
        if (!((i / width) % 2) && !(i % 2)) {
          destination[upos++] = ( ( -38*r + -74*g + 112*b ) >> 8) + 128;
          destination[vpos++] = ( ( 112*r + -94*g + -18*b ) >> 8) + 128;
        }
      }
    }
    

    EDIT

    You should also rearrange the code, so that you can remove the if(). Small, simple inner loops without branches are fast. Here, it may be a good idea to first write Y plane, then U and V planes, like this:

    void Bitmap2Yuv420p( boost::uint8_t *destination, boost::uint8_t *rgb,
                         const int &width, const int &height ) {
      const size_t image_size = width * height;
      boost::uint8_t *dst_y = destination;
      boost::uint8_t *dst_u = destination + image_size;
      boost::uint8_t *dst_v = destination + image_size + image_size/4;
    
      // Y plane
      for( size_t i = 0; i < image_size; ++i ) {
        *dst_y++ = ( ( 66*rgb[3*i] + 129*rgb[3*i+1] + 25*rgb[3*i+2] ) >> 8 ) + 16;
      }
    #if 1
      // U plane
      for( size_t y=0; y<height; y+=2 ) {
        for( size_t x=0; x<width; x+=2 ) {
          const size_t i = y*width + x;
          *dst_u++ = ( ( -38*rgb[3*i] + -74*rgb[3*i+1] + 112*rgb[3*i+2] ) >> 8 ) + 128;
      }
      // V plane
      for( size_t y=0; y<height; y+=2 ) {
        for( size_t x=0; x<width; x+=2 ) {
          const size_t i = y*width + x;
          *dst_v++ = ( ( 112*rgb[3*i] + -94*rgb[3*i+1] + -18*rgb[3*i+2] ) >> 8 ) + 128;
      }
    #else // also try this version:
      // U+V planes
      for( size_t y=0; y<height; y+=2 ) {
        for( size_t x=0; x<width; x+=2 ) {
          const size_t i = y*width + x;
          *dst_u++ = ( ( -38*rgb[3*i] + -74*rgb[3*i+1] + 112*rgb[3*i+2] ) >> 8 ) + 128;
          *dst_v++ = ( ( 112*rgb[3*i] + -94*rgb[3*i+1] + -18*rgb[3*i+2] ) >> 8 ) + 128;
      }
    #endif
    }
    
    0 讨论(0)
  • 2021-01-30 10:31

    Do not access pointers more then once, copy the value to the stack and then use the value on the stack. (Aliasing)

    ...
    int v_r = *r;
    int v_g = *g;
    int v_b = *b;
    
    *y = ((lookup66[v_r] + lookup129[v_g] + lookup25[v_b]) >> 8) + 16;
    ...
    

    On the other hand, you can do it in SSE without look-up tables and would do 8 pixels at once.

    0 讨论(0)
  • 2021-01-30 10:31

    You can use SSE or 3dNow assembly codes to further boost the performance.

    As for c++ code, I think it is hard to improve based on your current code.

    0 讨论(0)
提交回复
热议问题