Bit Aligning for Space and Performance Boosts

后端 未结 5 2070
闹比i
闹比i 2021-02-14 23:03

In the book Game Coding Complete, 3rd Edition, the author mentions a technique to both reduce data structure size and increase access performance. In essence it relies

5条回答
  •  抹茶落季
    2021-02-14 23:13

    It is highly dependent on the hardware.

    Let me demonstrate:

    #pragma pack( push, 1 )
    
    struct SlowStruct
    {
        char c;
        __int64 a;
        int b;
        char d;
    };
    
    struct FastStruct
    {
        __int64 a;
        int b;
        char c;
        char d;  
        char unused[ 2 ]; // fill to 8-byte boundary for array use
    };
    
    #pragma pack( pop )
    
    int main (void){
    
        int x = 1000;
        int iterations = 10000000;
    
        SlowStruct *slow = new SlowStruct[x];
        FastStruct *fast = new FastStruct[x];
    
    
    
        //  Warm the cache.
        memset(slow,0,x * sizeof(SlowStruct));
        clock_t time0 = clock();
        for (int c = 0; c < iterations; c++){
            for (int i = 0; i < x; i++){
                slow[i].a += c;
            }
        }
        clock_t time1 = clock();
        cout << "slow = " << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;
        
        //  Warm the cache.
        memset(fast,0,x * sizeof(FastStruct));
        time1 = clock();
        for (int c = 0; c < iterations; c++){
            for (int i = 0; i < x; i++){
                fast[i].a += c;
            }
        }
        clock_t time2 = clock();
        cout << "fast = " << (double)(time2 - time1) / CLOCKS_PER_SEC << endl;
    
    
    
        //  Print to avoid Dead Code Elimination
        __int64 sum = 0;
        for (int c = 0; c < x; c++){
            sum += slow[c].a;
            sum += fast[c].a;
        }
        cout << "sum = " << sum << endl;
    
    
        return 0;
    }
    

    Core i7 920 @ 3.5 GHz

    slow = 4.578
    fast = 4.434
    sum = 99999990000000000
    

    Okay, not much difference. But it's still consistent over multiple runs.
    So the alignment makes a small difference on Nehalem Core i7.


    Intel Xeon X5482 Harpertown @ 3.2 GHz (Core 2 - generation Xeon)

    slow = 22.803
    fast = 3.669
    sum = 99999990000000000
    

    Now take a look...

    6.2x faster!!!


    Conclusion:

    You see the results. You decide whether or not it's worth your time to do these optimizations.


    EDIT :

    Same benchmarks but without the #pragma pack:

    Core i7 920 @ 3.5 GHz

    slow = 4.49
    fast = 4.442
    sum = 99999990000000000
    

    Intel Xeon X5482 Harpertown @ 3.2 GHz

    slow = 3.684
    fast = 3.717
    sum = 99999990000000000
    
    • The Core i7 numbers didn't change. Apparently it can handle misalignment without trouble for this benchmark.
    • The Core 2 Xeon now shows the same times for both versions. This confirms that misalignment is a problem on the Core 2 architecture.

    Taken from my comment:

    If you leave out the #pragma pack, the compiler will keep everything aligned so you don't see this issue. So this is actually an example of what could happen if you misuse #pragma pack.

提交回复
热议问题