Bit Aligning for Space and Performance Boosts

后端 未结 5 2088
闹比i
闹比i 2021-02-14 23:03

In the book Game Coding Complete, 3rd Edition, the author mentions a technique to both reduce data structure size and increase access performance. In essence it relies

相关标签:
5条回答
  • 2021-02-14 23:13

    Visual Studio is a great compiler when it comes to optimization. However, bear in mind that the current "Optimization War" in game development is not on the PC arena. While such optimizations may quite well be dead on the PC, on the console platforms it's a completely different pair of shoes.

    That said, you might want to repost this question on the specialized gamedev stackexchange site, you might get some answers directly from "the field".

    Finally, your results are exactly the same up to the microsecond which is dead impossible on a modern multithreaded system -- I'm pretty sure you either use a very low resolution timer, or your timing code is broken.

    0 讨论(0)
  • 2021-02-14 23:13

    It is highly dependent on the hardware.

    Let me demonstrate:

    #pragma pack( push, 1 )
    
    struct SlowStruct
    {
        char c;
        __int64 a;
        int b;
        char d;
    };
    
    struct FastStruct
    {
        __int64 a;
        int b;
        char c;
        char d;  
        char unused[ 2 ]; // fill to 8-byte boundary for array use
    };
    
    #pragma pack( pop )
    
    int main (void){
    
        int x = 1000;
        int iterations = 10000000;
    
        SlowStruct *slow = new SlowStruct[x];
        FastStruct *fast = new FastStruct[x];
    
    
    
        //  Warm the cache.
        memset(slow,0,x * sizeof(SlowStruct));
        clock_t time0 = clock();
        for (int c = 0; c < iterations; c++){
            for (int i = 0; i < x; i++){
                slow[i].a += c;
            }
        }
        clock_t time1 = clock();
        cout << "slow = " << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;
        
        //  Warm the cache.
        memset(fast,0,x * sizeof(FastStruct));
        time1 = clock();
        for (int c = 0; c < iterations; c++){
            for (int i = 0; i < x; i++){
                fast[i].a += c;
            }
        }
        clock_t time2 = clock();
        cout << "fast = " << (double)(time2 - time1) / CLOCKS_PER_SEC << endl;
    
    
    
        //  Print to avoid Dead Code Elimination
        __int64 sum = 0;
        for (int c = 0; c < x; c++){
            sum += slow[c].a;
            sum += fast[c].a;
        }
        cout << "sum = " << sum << endl;
    
    
        return 0;
    }
    

    Core i7 920 @ 3.5 GHz

    slow = 4.578
    fast = 4.434
    sum = 99999990000000000
    

    Okay, not much difference. But it's still consistent over multiple runs.
    So the alignment makes a small difference on Nehalem Core i7.


    Intel Xeon X5482 Harpertown @ 3.2 GHz (Core 2 - generation Xeon)

    slow = 22.803
    fast = 3.669
    sum = 99999990000000000
    

    Now take a look...

    6.2x faster!!!


    Conclusion:

    You see the results. You decide whether or not it's worth your time to do these optimizations.


    EDIT :

    Same benchmarks but without the #pragma pack:

    Core i7 920 @ 3.5 GHz

    slow = 4.49
    fast = 4.442
    sum = 99999990000000000
    

    Intel Xeon X5482 Harpertown @ 3.2 GHz

    slow = 3.684
    fast = 3.717
    sum = 99999990000000000
    
    • The Core i7 numbers didn't change. Apparently it can handle misalignment without trouble for this benchmark.
    • The Core 2 Xeon now shows the same times for both versions. This confirms that misalignment is a problem on the Core 2 architecture.

    Taken from my comment:

    If you leave out the #pragma pack, the compiler will keep everything aligned so you don't see this issue. So this is actually an example of what could happen if you misuse #pragma pack.

    0 讨论(0)
  • 2021-02-14 23:17

    Such hand-optimizations are generally long dead. Alignment is only a serious consideration if you're packing for space, or if you have an enforced-alignment type like SSE types. The compiler's default alignment and packing rules are intentionally designed to maximize performance, obviously, and whilst hand-tuning them can be beneficial, it's not generally worth it.

    Probably, in your test program, the compiler never stored any structure on the stack and just kept the members in registers, which do not have alignment, which means that it's fairly irrelevant what the structure size or alignment is.

    Here's the thing: There can be aliasing and other nasties with sub-word accessing, and it's no slower to access a whole word than to access a sub-word. So in general, it's no more efficient, in time, to pack more tightly than word size if you're only accessing, say, one member.

    0 讨论(0)
  • 2021-02-14 23:29

    Modern compilers align members on different byte boundaries depending on the size of the member. See the bottom of this.

    Normally you really shouldn't care about structure padding but if you have an object that is going to have 1000000 instances or something the rule of the thumb is simply to order your members from biggest to smallest. I wouldn't recommend messing with the padding with #pragma directives.

    0 讨论(0)
  • 2021-02-14 23:30

    The compiler is going to either optimize for size or speed and unless you explicitly tell it you wont know what you get. But if you follow the advice of that book you will win-win on most compilers. Put the biggest, aligned, things first in your struct then half size stuff, then single byte stuff if any, add some dummy variables to align. Using bytes for things that dont have to be can be a performance hit anyway, as a compromise use ints for everything (have to know the pros and cons of doing that)

    The x86 has made for a lot of bad programmers and compilers because it allows unaligned accesses. Making it hard for many folks to move to other platforms (that are taking over). Although unaligned accesses work on an x86 you take a serious performance hit. Which is why it is important to know how compilers work both in general as well as the particular one you are using.

    having caches, and as with the modern computer platforms relying on caches to get any kind of performance, you want to both be aligned and packed. The simple rule being taught gives you both...in general. It is very good advice. Adding compiler specific pragmas is not nearly as good, makes the code non-portable, and doesnt take much searching through SO or googling to find out how often the compiler ignores the pragma or doesnt do what you really wanted.

    0 讨论(0)
提交回复
热议问题