Reset C int array to zero : the fastest way?

前端 未结 7 1296
感情败类
感情败类 2020-12-22 18:34

Assuming that we have a T myarray[100] with T = int, unsigned int, long long int or unsigned long long int, what is the fastest way to reset all its content to

相关标签:
7条回答
  • 2020-12-22 19:13

    This question, although rather old, needs some benchmarks, as it asks for not the most idiomatic way, or the way that can be written in the fewest number of lines, but the fastest way. And it is silly to answer that question without some actual testing. So I compared four solutions, memset vs. std::fill vs. ZERO of AnT's answer vs a solution I made using AVX intrinsics.

    Note that this solution is not generic, it only works on data of 32 or 64 bits. Please comment if this code is doing something incorrect.

    #include<immintrin.h>
    #define intrin_ZERO(a,n){\
    size_t x = 0;\
    const size_t inc = 32 / sizeof(*(a));/*size of 256 bit register over size of variable*/\
    for (;x < n-inc;x+=inc)\
        _mm256_storeu_ps((float *)((a)+x),_mm256_setzero_ps());\
    if(4 == sizeof(*(a))){\
        switch(n-x){\
        case 3:\
            (a)[x] = 0;x++;\
        case 2:\
            _mm_storeu_ps((float *)((a)+x),_mm_setzero_ps());break;\
        case 1:\
            (a)[x] = 0;\
            break;\
        case 0:\
            break;\
        };\
    }\
    else if(8 == sizeof(*(a))){\
    switch(n-x){\
        case 7:\
            (a)[x] = 0;x++;\
        case 6:\
            (a)[x] = 0;x++;\
        case 5:\
            (a)[x] = 0;x++;\
        case 4:\
            _mm_storeu_ps((float *)((a)+x),_mm_setzero_ps());break;\
        case 3:\
            (a)[x] = 0;x++;\
        case 2:\
            ((long long *)(a))[x] = 0;break;\
        case 1:\
            (a)[x] = 0;\
            break;\
        case 0:\
            break;\
    };\
    }\
    }
    

    I will not claim that this is the fastest method, since I am not a low level optimization expert. Rather it is an example of a correct architecture dependent implementation that is faster than memset.

    Now, onto the results. I calculated performance for size 100 int and long long arrays, both statically and dynamically allocated, but with the exception of msvc, which did a dead code elimination on static arrays, the results were extremely comparable, so I will show only dynamic array performance. Time markings are ms for 1 million iterations, using time.h's low precision clock function.

    clang 3.8 (Using the clang-cl frontend, optimization flags= /OX /arch:AVX /Oi /Ot)

    int:
    memset:      99
    fill:        97
    ZERO:        98
    intrin_ZERO: 90
    
    long long:
    memset:      285
    fill:        286
    ZERO:        285
    intrin_ZERO: 188
    

    gcc 5.1.0 (optimization flags: -O3 -march=native -mtune=native -mavx):

    int:
    memset:      268
    fill:        268
    ZERO:        268
    intrin_ZERO: 91
    long long:
    memset:      402
    fill:        399
    ZERO:        400
    intrin_ZERO: 185
    

    msvc 2015 (optimization flags: /OX /arch:AVX /Oi /Ot):

    int
    memset:      196
    fill:        613
    ZERO:        221
    intrin_ZERO: 95
    long long:
    memset:      273
    fill:        559
    ZERO:        376
    intrin_ZERO: 188
    

    There is a lot interesting going on here: llvm killing gcc, MSVC's typical spotty optimizations (it does an impressive dead code elimination on static arrays and then has awful performance for fill). Although my implementation is significantly faster, this may only be because it recognizes that bit clearing has much less overhead than any other setting operation.

    Clang's implementation merits more looking at, as it is significantly faster. Some additional testing shows that its memset is in fact specialized for zero--non zero memsets for 400 byte array are much slower (~220ms) and are comparable to gcc's. However, the nonzero memsetting with an 800 byte array makes no speed difference, which is probably why in that case, their memset has worse performance than my implementation--the specialization is only for small arrays, and the cuttoff is right around 800 bytes. Also note that gcc 'fill' and 'ZERO' are not optimizing to memset (looking at generated code), gcc is simply generating code with identical performance characteristics.

    Conclusion: memset is not really optimized for this task as well as people would pretend it is (otherwise gcc and msvc and llvm's memset would have the same performance). If performance matters then memset should not be a final solution, especially for these awkward medium sized arrays, because it is not specialized for bit clearing, and it is not hand optimized any better than the compiler can do on its own.

    0 讨论(0)
提交回复
热议问题