C++ performance: checking a block of memory for having specific values in specific cells

问题

I'm doing a research on 2D Bin Packing algorithms. I've asked similar question regarding PHP's performance - it was too slow to pack - and now the code is converted to C++.

It's still pretty slow. What my program does is consequently allocating blocks of dynamic memory and populating them with a character 'o'

char* bin;
bin = new (nothrow) char[area];
if (bin == 0) {
    cout << "Error: " << area << " bytes could not be allocated";
    return false;
}
for (int i=0; i<area; i++) {
    bin[i]='o';
}

(their size is between 1kb and 30kb for my datasets)

Then the program checks different combinations of 'x' characters inside of current memory block.

void place(char* bin, int* best, int width)
{   
    for (int i=best[0]; i<best[0]+best[1]; i++)
        for (int j=best[2]; j<best[2]+best[3]; j++)
            bin[i*width+j] = 'x';
}

One of the functions that checks the non-overlapping gets called millions of times during a runtime.

bool fits(char* bin, int* pos, int width)
{   
    for (int i=pos[0]; i<pos[0]+pos[1]; i++)
        for (int j=pos[2]; j<pos[2]+pos[3]; j++)
            if (bin[i*width+j] == 'x')
                return false;
    return true;
}

All other stuff takes only a percent of the runtime, so I need to make these two guys (fits and place) faster. Who's the culprit?

Since I only have two options 'x' and 'o', I could try to use just one bit instead of the whole byte the char takes. But I'm more concerned with the speed, you think it would make the things faster?

Thanks!

Update: I replaced int* pos with rect pos (the same for best), as MSalters suggested. At first I saw improvement, but I tested more with bigger datasets and it seems to be back to normal runtimes. I'll try other techniques suggested and will keep you posted.

Update: using memset and memchr sped up things about twice. Replacing 'x' and 'o' with '\1' and '\0' didn't show any improvement. __restrict wasn't helpful either. Overall, I'm satisfied with the performance of the program now since I also made some improvements to the algorithm itself. I'm yet to try using a bitmap and compiling with -02 (-03)... Thanks again everybody.

回答1:

Best possibility would be to use an algorithm with better complexity.

But even your current algorithm could be sped up. Try using SSE instructions to test ~16 bytes at once, also you can make a single large allocation and split it yourself, this will be faster than using the library allocator (the library allocator has the advantage of letting you free blocks individually, but I don't think you need that feature).

回答2:

[ Of course: profile it!]

Using a bit rather than a byte will not be faster in the first instance.

However, consider that with characters, you can cast blocks of 4 or 8 bytes to unsigned 32 bit or 64 bit integers (making sure you handle alignment), and compare that to the value for 'oooo' or 'oooooooo' in the block. That allows a very fast compare.

Now having gone down the integer approach, you can see that you could do that same with the bit approach and handle say 64 bits in a single compare. That should surely give a real speed up.

回答3:

Bitmaps will increase the speed as well, since they involve touching less memory and thus will cause more memory references to come from the cache. Also, in place, you might want to copy the elements of best into local variables so that the compiler knows that your writes to bin will not change best. If your compiler supports some spelling of restrict, you might want to use that as well. You can also replace the inner loop in place with the memset library function, and the inner loop in fits with memchr; those may not be large performance improvements, though.

回答4:

First of all, have you remembered to tell your compiler to optimize?

And turn off slow array index bounds checking and such?

That done, you will get substantial speed-up by representing your binary values as individual bits, since you can then set or clear say 32 or 64 bits at a time.

Also I would tend to assume that the dynamic allocations would give a fair bit of overhead, but apparently you have measured and found that it isn't so. If however the memory management actually contributes significantly to the time, then a solution depends a bit on the usage pattern. But possibly your code generates stack-like alloc/free behavior, in which case you can optimize the allocations down to almost nothing; just allocate a big chunk of memory at the start and then sub-allocate stack-like from that.

Considering your current code:

void place(char* bin, int* best, int width)
{   
    for (int i=best[0]; i<best[0]+best[1]; i++)
        for (int j=best[2]; j<best[2]+best[3]; j++)
            bin[i*width+j] = 'x';
}

Due to possible aliasing the compiler may not realize that e.g. best[0] will be constant during the loop.

So, tell it:

void place(char* bin, int const* best, int const width)
{
    int const maxY = best[0] + best[1];
    int const maxX = best[2] + best[3];

    for( int y = best[0]; y < maxY; ++y )
    {
        for( int x = best[2]; x < maxX; ++x )
        {
            bin[y*width + x] = 'x';
        }
    }
}

Most probably your compiler will hoist the y*width computation out of the inner loop, but why not tell it do also that:

void place(char* bin, int* best, int const width)
{
    int const maxY = best[0]+best[1];
    int const maxX = best[2]+best[3];

    for( int y = best[0]; y < maxY; ++y )
    {
        int const startOfRow  = y*width;

        for( int x = best[2]; x < maxX; ++x )
        {
            bin[startOfRow + x] = 'x';
        }
    }
}

This manual optimization (also applied to other routine) may or may not help, it depends on how smart your compiler is.

Next, if that doesn't help enough, consider replacing inner loop with std::fill (or memset), doing a whole row in one fell swoop.

And if that doesn't help or doesn't help enough, switch over to bit-level representation.

It is perhaps worth noting and trying out, that every PC has built-in hardware support for optimizing the bit-level operations, namely a graphics accelerator card (in old times called blitter chip). So, you might just use an image library and a black/white bitmap. But since your rectangles are small I'm not sure whether the setup overhead will outweight the speed of the actual operation – needs to be measured. ;-)

Cheers & hth.,

回答5:

The biggest improvement I'd expect is from a non-trivial change:

// changed pos to class rect for cleaner syntax
bool fits(char* bin, rect pos, int width)
{
    if (bin[pos.top()*width+pos.left()] == 'x')
                return false;
    if (bin[(pos.bottom()-1*width+pos.right()] == 'x')
                return false;
    if (bin[(pos.bottom()*width+pos.left()] == 'x')
                return false;
    if (bin[pos.top()*width+pos.right()] == 'x')
                return false;

    for (int i=pos.top(); i<=pos.bottom(); i++)
        for (int j=pos.left(); j<=pos.right(); j++)
            if (bin[i*width+j] == 'x')
                return false;
    return true;
}

Sure, you're testing bin[(pos.bottom()-1*width+pos.right()] twice. But the first time you do so is much earlier in the algorithm. You add boxes, which means that there is a strong correlation between adjacent bins. Therefore, by checking the corners first, you often return a lot earlier. You could even consider adding a 5th check in the middle.

回答6:

Beyond the obligatory statement about using a profiler, The advice above about replacing things with a bit map is a very good idea. If that does not appeal to you..

Consider replacing

for (int i=0; i<area; i++) {
    bin[i]='o';
}

memset(bin, 'o', area);

Typically a memset will be faster, as it compiles into less machine code.

Also

void place(char* bin, int* best, int width)
{   
    for (int i=best[0]; i<best[0]+best[1]; i++)
        for (int j=best[2]; j<best[2]+best[3]; j++)
            bin[i*width+j] = 'x';
}

has a bit of room.for improvement

void place(char* bin, int* best, int width)
{   
    for (int i=best[0]; i<best[0]+best[1]; i++)

        memset(                         (i * width)  + best[2], 
                'x', 
                (best[2] + best[3]) - (((i * width)) + best[2]) + 1); 
}

by eliminating one of the loops.

A last idea is to change your data representation. Consider using the '\0' character as a replacement for your 'o' and '\1' as a replacement for your 'x' character. This is sort of like using a bit map.

This would enable you to test like this.

if (best[1])
{
    // Is a 'x'
}
else
{
    // Is a 'o'
}

Which might produce faster code. Again the profiler is your friend :)

This representation would also enable you to simply sum a set of character to determine how many 'x's and 'o's there are.

int sum = 0;
for (int i = 0; i < 12; i++)
{
    sum += best[i];
}

cout << "There are " << sum << "'x's in the range" << endl;

Best of luck to you

Evil.

回答7:

If you have 2 values for your basic type, I would first try to use bool. Then the compiler knows you have 2 values and might be able to optimize some things better. Appart from that add const where possible (for example the parameter of fits( bool const*,...)).

回答8:

I'd think about memory cache breaks. These functions run through sub-matrices inside a bigger matrix - I suppose many times much bigger on both width and height. That means the small matrix lines are contiguous memory but between lines it might break memory cache pages. Consider representing the big matrix cells in memory in an order that would keep sub-matrices elements close to each other as possible. That is instead of keeping a vector of contiguous full lines. First option comes to my mind, is to break your big matrix recursively to matrices of size [ 2^i, 2^i ] ordered { top-left, top-right, bottom-left, bottom-right }.

1) i.e. if your matrix is size [X,Y], represented in an array of size X*Y, then element [x,y] is at position(x,y) in the array:

use instead of (y*X+x):

unsigned position( rx, ry )
{
  unsigned x = rx;
  unsigned y = rx;
  unsigned part = 1;
  unsigned pos = 0;
  while( ( x != 0 ) && ( y != 0 ) ) {
    unsigned const lowest_bit_x = ( x % 2 );
    unsigned const lowest_bit_y = ( y % 2 );
    pos += ( ((2*lowest_bit_y) + lowest_bit_x) * part );
    x /= 2; //throw away lowest bit
    y /= 2;
    part *= 4; //size grows by sqare(2)
  }
  return pos;
}

I didn't check this code, just to explain what I mean. If you need, also try to find a faster way to implement.

but note that the array you allocate will be bigger than X*Y, it has to be the smaller possible (2^(2*k)), and that would be wastefull unless X and Y are about same size scale. But it can be solved by further breaking the big matrix to sqaures first.

And then cache benfits might outwight the more complex position(x,y).

2) then try to find the best way to run through the elements of a sub-matrix in fits() and place(). Not sure yet what it is, not necessarily like you do now. Basically a sub-matrix of size [x,y] should break into no more than y*log(x)*log(y) blocks that are contiguous in the array representation, but they all fit inside no more than 4 blocks of size 4*x*y. So finally, for matrices that are smaller than a memory cache page, you'll get no more than 4 memory cache breaks, while your original code could break y times.

来源：https://stackoverflow.com/questions/4965817/c-performance-checking-a-block-of-memory-for-having-specific-values-in-specif

标签

c++

performance

dynamic-memory-allocation