LockBits Performance Critical Code

I have a method which needs to be as fast as it possibly can, it uses unsafe memory pointers and its my first foray into this type of coding so I know it can probably be faster.

    /// <summary>
    /// Copies bitmapdata from one bitmap to another at a specified point on the output bitmapdata
    /// </summary>
    /// <param name="sourcebtmpdata">The sourcebitmap must be smaller that the destbitmap</param>
    /// <param name="destbtmpdata"></param>
    /// <param name="point">The point on the destination bitmap to draw at</param>
    private static unsafe void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
    {
        // calculate total number of rows to draw.
        var totalRow = Math.Min(
            destbtmpdata.Height - point.Y,
            sourcebtmpdata.Height);


        //loop through each row on the source bitmap and get mem pointers
        //to the source bitmap and dest bitmap
        for (int i = 0; i < totalRow; i++)
        {
            int destRow = point.Y + i;

            //get the pointer to the start of the current pixel "row" on the output image
            byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride);
            //get the pointer to the start of the FIRST pixel row on the source image
            byte* srcRowPtr = (byte*)sourcebtmpdata.Scan0 + (i * sourcebtmpdata.Stride);

            int pointX = point.X;
            //the rowSize is pre-computed before the loop to improve performance
            int rowSize = Math.Min(destbtmpdata.Width - pointX, sourcebtmpdata.Width);
            //for each row each set each pixel
            for (int j = 0; j < rowSize; j++)
            {
                int firstBlueByte = ((pointX + j)*3);

                int srcByte = j *3;
                destRowPtr[(firstBlueByte)] = srcRowPtr[srcByte];
                destRowPtr[(firstBlueByte) + 1] = srcRowPtr[srcByte + 1];
                destRowPtr[(firstBlueByte) + 2] = srcRowPtr[srcByte + 2];
            }


        }
    }

So is there anything that can be done to make this faster? Ignore the todo for now, ill fix that later once I have some baseline performance measurements.

UPDATE: Sorry, should have mentioned that the reason i'm using this instead of Graphics.DrawImage is because im implementing multi-threading and because of that I cant use DrawImage.

UPDATE 2: I'm still not satisfied with the performance and i'm sure there's a few more ms that can be had.

Lee Treveil

There was something fundamentally wrong with the code that I cant believe I didn't notice until now.

byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride);

This gets a pointer to the destination row but it does not get the column that it is copying to, that in the old code is done inside the rowSize loop. It now looks like:

byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride) + pointX * 3;

So now we have the correct pointer for the destination data. Now we can get rid of that for loop. Using suggestions from Vilx- and Rob the code now looks like:

        private static unsafe void CopyBitmapToDestSuperFast(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
    {
        //calculate total number of rows to copy.
        //using ternary operator instead of Math.Min, few ms faster
        int totalRows = (destbtmpdata.Height - point.Y < sourcebtmpdata.Height) ? destbtmpdata.Height - point.Y : sourcebtmpdata.Height;
        //calculate the width of the image to draw, this cuts off the image
        //if it goes past the width of the destination image
        int rowWidth = (destbtmpdata.Width - point.X < sourcebtmpdata.Width) ? destbtmpdata.Width - point.X : sourcebtmpdata.Width;

        //loop through each row on the source bitmap and get mem pointers
        //to the source bitmap and dest bitmap
        for (int i = 0; i < totalRows; i++)
        {
            int destRow = point.Y + i;

            //get the pointer to the start of the current pixel "row" and column on the output image
            byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride) + point.X * 3;

            //get the pointer to the start of the FIRST pixel row on the source image
            byte* srcRowPtr = (byte*)sourcebtmpdata.Scan0 + (i * sourcebtmpdata.Stride);

            //RtlMoveMemory function
            CopyMemory(new IntPtr(destRowPtr), new IntPtr(srcRowPtr), (uint)rowWidth * 3);

        }
    }

Copying a 500x500 image to a 5000x5000 image in a grid 50 times took: 00:00:07.9948993 secs. Now with the changes above it takes 00:00:01.8714263 secs. Much better.

Well... I'm not sure whether .NET bitmap data formats are entirely compatible with Windows's GDI32 functions...

But one of the first few Win32 API I learned was BitBlt:

BOOL BitBlt(
  HDC hdcDest, 
  int nXDest, 
  int nYDest, 
  int nWidth, 
  int nHeight, 
  HDC hdcSrc, 
  int nXSrc, 
  int nYSrc, 
  DWORD dwRop
);

And it was the fastest way to copy data around, if I remember correctly.

Here's the BitBlt PInvoke signature for use in C# and related usage information, a great read for any one working with high-performance graphics in C#:

http://www.pinvoke.net/default.aspx/gdi32/BitBlt.html

Definitely worth a look.

The inner loop is where you want to concentrate a lot of your time (but, do measurements to make sure)

for  (int j = 0; j < sourcebtmpdata.Width; j++)
{
    destRowPtr[(point.X + j) * 3] = srcRowPtr[j * 3];
    destRowPtr[((point.X + j) * 3) + 1] = srcRowPtr[(j * 3) + 1];
    destRowPtr[((point.X + j) * 3) + 2] = srcRowPtr[(j * 3) + 2];
}

Get rid of the multiplies and the array indexing (which is a multiply under the hoods) and replace with a pointer that you are incrementing.
Ditto with the +1, +2, increment a pointer.
Probably your compiler won't keep computing point.X (check), but make a local variable just in case. It won't do it on the single iteration, but it might each iteration.

You may want to look at Eigen.

It is a C++ template library that uses SSE (2 and later) and AltiVec instruction sets with graceful fallback to non-vectorized code.

Fast. (See benchmark).
Expression templates allow to intelligently remove temporaries and enable lazy evaluation, when that is appropriate -- Eigen takes care of this automatically and handles aliasing too in most cases.
Explicit vectorization is performed for the SSE (2 and later) and AltiVec instruction sets, with graceful fallback to non-vectorized code. Expression templates allow to perform these optimizations globally for whole expressions.
With fixed-size objects, dynamic memory allocation is avoided, and the loops are unrolled when that makes sense.
For large matrices, special attention is paid to cache-friendliness.

You could implement you function in C++ and then call that from C#

You don't always need to use pointers to get good speed. This should be within a couple ms of the original:

        private static void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
    {
        byte[] src = new byte[sourcebtmpdata.Height * sourcebtmpdata.Width * 3];
        int maximum = src.Length;
        byte[] dest = new byte[maximum];
        Marshal.Copy(sourcebtmpdata.Scan0, src, 0, src.Length);
        int pointX = point.X * 3;
        int copyLength = destbtmpdata.Width*3 - pointX;
        int k = pointX + point.Y * sourcebtmpdata.Stride;
        int rowWidth = sourcebtmpdata.Stride;
        while (k<maximum)
        {
            Array.Copy(src,k,dest,k,copyLength);
            k += rowWidth;

        }
        Marshal.Copy(dest, 0, destbtmpdata.Scan0, dest.Length);
    }

Unfortunately I don't have the time to write a full solution, but I would look into using the platform RtlMoveMemory() function to move rows as a whole, not byte-by-byte. That should be a lot faster.

I think the stride size and row number limits can be calculated in advance.

And I precalculated all multiplications, resulting in the following code:

private static unsafe void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
    //TODO: It is expected that the bitmap PixelFormat is Format24bppRgb but this could change in the future
    const int pixelSize = 3;

    // calculate total number of rows to draw.
    var totalRow = Math.Min(
        destbtmpdata.Height - point.Y,
        sourcebtmpdata.Height);

    var rowSize = Math.Min(
        (destbtmpdata.Width - point.X) * pixelSize,
        sourcebtmpdata.Width * pixelSize);

    // starting point of copy operation
    byte* srcPtr = (byte*)sourcebtmpdata.Scan0;
    byte* destPtr = (byte*)destbtmpdata.Scan0 + point.Y * destbtmpdata.Stride;

    // loop through each row
    for (int i = 0; i < totalRow; i++) {

        // draw the entire row
        for (int j = 0; j < rowSize; j++)
            destPtr[point.X + j] = srcPtr[j];

        // advance each pointer by 1 row
        destPtr += destbtmpdata.Stride;
        srcPtr += sourcebtmpdata.Stride;
    }

}

Havn't tested it thoroughly but you should be able to get that to work.

I have removed multiplication operations from the loop (pre-calculated instead) and removed most branchings so it should be somewhat faster.

Let me know if this helps :-)

I am looking at your C# code and I can't recognize anything familiar. It all looks like a ton of C++. BTW, it looks like DirectX/XNA needs to become your new friend. Just my 2 cents. Don't kill the messenger.

If you must rely on CPU to do this: I've done some 24-bit layout optimizations myself, and I can tell you that memory access speed should be your bottleneck. Use SSE3 instructions for fastest possible bytewise access. This means C++ and embedded assembly language. In pure C you'll be 30% slower on most machines.

Keep in mind that modern GPUs are MUCH faster than CPU in this sort of operations.

I am not sure if this will give extra performance, but I see the pattern a lot in Reflector.

So:

int srcByte = j *3;
destRowPtr[(firstBlueByte)] = srcRowPtr[srcByte];
destRowPtr[(firstBlueByte) + 1] = srcRowPtr[srcByte + 1];
destRowPtr[(firstBlueByte) + 2] = srcRowPtr[srcByte + 2];

Becomes:

*destRowPtr++ = *srcRowPtr++;
*destRowPtr++ = *srcRowPtr++;
*destRowPtr++ = *srcRowPtr++;

Probably needs more braces.

If the width is fixed, you could probably unroll the entire line into a few hundred lines. :)

Update

You could also try using a bigger type, eg Int32 or Int64 for better performance.

Alright, this is going to be fairly close to the line of how many ms you can get out of the algorithm, but get rid of the call to Math.Min and replace it with a trinary operator instead.

Generally, making a library call will take longer than doing something on your own and I made a simple test driver to confirm this for Math.Min.

using System;
using System.Diagnostics;

namespace TestDriver
{
    class Program
    {
        static void Main(string[] args)
        {
            // Start the stopwatch
            if (Stopwatch.IsHighResolution)
            { Console.WriteLine("Using high resolution timer"); }
            else
            { Console.WriteLine("High resolution timer unavailable"); }
            // Test Math.Min for 10000 iterations
            Stopwatch sw = Stopwatch.StartNew();
            for (int ndx = 0; ndx < 10000; ndx++)
            {
                int result = Math.Min(ndx, 5000);
            }
            Console.WriteLine(sw.Elapsed.TotalMilliseconds.ToString("0.0000"));
            // Test trinary operator for 10000 iterations
            sw = Stopwatch.StartNew();
            for (int ndx = 0; ndx < 10000; ndx++)
            {
                int result = (ndx < 5000) ? ndx : 5000;
            }
            Console.WriteLine(sw.Elapsed.TotalMilliseconds.ToString("0.0000"));
            Console.ReadKey();
        }
    }
}

The results when running the above on my computer, an Intel T2400 @1.83GHz. Also, note that there is a bit of variation in the results, but generally the trinay operator is faster by about 0.01 ms. That's not much, but over a big enough dataset it will add up.

Using high resolution timer
0.0539
0.0402

来源：https://stackoverflow.com/questions/740555/lockbits-performance-critical-code

标签

performance

image-processing

image-manipulation

unsafe