AVX 256-bit code performing slightly worse than equivalent 128-bit SSSE3 code

后端 未结 2 914
野趣味
野趣味 2021-02-04 02:34

I am trying to write very efficient Hamming-distance code. Inspired by Wojciech Muła\'s extremely clever SSE3 popcount implementation, I coded an AVX2 equivale

相关标签:
2条回答
  • 2021-02-04 03:03

    You should consider using the usual _mm_popcnt_u64 instruction instead of hacking it in SSE or AVX. I tested all methods for popcounting thoroughly, including an SSE and AVX version (which ultimately led to my more or less famous question about popcount). _mm_popcnt_u64 outperforms SSE and AVX considerably, especially when you use a compiler which prevents the Intel popcount bug discovered in my question. Without the bug, my Haswell is able to popcount 26 GB/s which almost hits the bus bandwidth.

    The reason why _mm_popcnt_u64 is faster is simply due to the fact that it popcounts 64 bits at once (so already 1/4 of the AVX version) while requiring only one cheap processor instruction. It costs only a few cycles (latency 3, throughput 1 for Intel). Even if every AVX instruction you use required only one cycle, you would still get worse results due to the shear amount of instructions necessary for popcounting 256 bits.

    Try this, it should be fastest:

    int popcount256(const uint64_t* u){ 
        return _mm_popcnt_u64(u[0]);
             + _mm_popcnt_u64(u[1]);
             + _mm_popcnt_u64(u[2]);
             + _mm_popcnt_u64(u[3]);
    }
    

    I know this does not answer your core question why AVX is slower, but since your ultimate goal is fast popcount, the AVX <-> SSE comparison is irrelevant as both are inferior to the builtin popcount.

    0 讨论(0)
  • 2021-02-04 03:22

    In addition to the minor issues in comments (compiling for /arch:AVX) your primary problem is the generation of the random input arrays on each iteration. This is your bottleneck so your test is not effectively evaluating your methods. Note - I'm not using boost, but GetTickCount works for this purpose. Consider just :

    int count;
    count = 0;
    {
        cout << "AVX PopCount\r\n";
        unsigned int Tick = GetTickCount();
        for (int i = 0; i < 1000000; i++) {
            for (int j = 0; j < 16; j++) {
                a[j] = dice();
                b[j] = dice();
            }
            count += AVX_PopCount(a, b);
        }
        Tick = GetTickCount() - Tick;
        cout << Tick << "\r\n";
    }
    

    produces output :

    AVX PopCount
    2309
    256002470

    So 2309ms to complete... but what happens if we get rid of your AVX routine altogether? Just make the input arrays :

    int count;
    count = 0;
    {
        cout << "Just making arrays...\r\n";
        unsigned int Tick = GetTickCount();
        for (int i = 0; i < 1000000; i++) {
            for (int j = 0; j < 16; j++) {
                a[j] = dice();
                b[j] = dice();
            }           
        }
        Tick = GetTickCount() - Tick;
        cout << Tick << "\r\n";
    }
    

    produces output:

    Just making arrays...
    2246

    How about that. Not surprising, really, since you're generating 32 random numbers, which can be quite expensive, and then performing only some rather fast integer math and shuffles.

    So...

    Now let's add a factor of 100 more iterations and get the random generator out of the tight loop. Compiling here with optimizations disabled will run your code as expected and will not throw away the "useless" iterations - presumably the code we care about here is already (manually) optimized!

        for (int j = 0; j < 16; j++) {
            a[j] = dice();
            b[j] = dice();
        }
    
        int count;
        count = 0;
        {
            cout << "AVX PopCount\r\n";
            unsigned int Tick = GetTickCount();
            for (int i = 0; i < 100000000; i++) {           
                count += AVX_PopCount(a, b);
            }
            Tick = GetTickCount() - Tick;
            cout << Tick << "\r\n";
        }
    
        cout << count << "\r\n";
    
        count = 0;
        {
            cout << "SSE PopCount\r\n";
            unsigned int Tick = GetTickCount();
            for (int i = 0; i < 100000000; i++) {
                count += SSE_PopCount(a, b);
            }
            Tick = GetTickCount() - Tick;
            cout << Tick << "\r\n";
        }
        cout << count << "\r\n";
    

    produces output :

    AVX PopCount
    3744
    730196224
    SSE PopCount
    5616
    730196224

    So congratulations - you can pat yourself on the back, your AVX routine is indeed about a third faster than the SSE routine (tested on Haswell i7 here). The lesson is to be sure that you are actually profiling what you think you are profiling!

    0 讨论(0)
提交回复
热议问题