Assuming that we have a T myarray[100]
with T = int, unsigned int, long long int or unsigned long long int, what is the fastest way to reset all its content to
This question, although rather old, needs some benchmarks, as it asks for not the most idiomatic way, or the way that can be written in the fewest number of lines, but the fastest way. And it is silly to answer that question without some actual testing. So I compared four solutions, memset vs. std::fill vs. ZERO of AnT's answer vs a solution I made using AVX intrinsics.
Note that this solution is not generic, it only works on data of 32 or 64 bits. Please comment if this code is doing something incorrect.
#include<immintrin.h>
#define intrin_ZERO(a,n){\
size_t x = 0;\
const size_t inc = 32 / sizeof(*(a));/*size of 256 bit register over size of variable*/\
for (;x < n-inc;x+=inc)\
_mm256_storeu_ps((float *)((a)+x),_mm256_setzero_ps());\
if(4 == sizeof(*(a))){\
switch(n-x){\
case 3:\
(a)[x] = 0;x++;\
case 2:\
_mm_storeu_ps((float *)((a)+x),_mm_setzero_ps());break;\
case 1:\
(a)[x] = 0;\
break;\
case 0:\
break;\
};\
}\
else if(8 == sizeof(*(a))){\
switch(n-x){\
case 7:\
(a)[x] = 0;x++;\
case 6:\
(a)[x] = 0;x++;\
case 5:\
(a)[x] = 0;x++;\
case 4:\
_mm_storeu_ps((float *)((a)+x),_mm_setzero_ps());break;\
case 3:\
(a)[x] = 0;x++;\
case 2:\
((long long *)(a))[x] = 0;break;\
case 1:\
(a)[x] = 0;\
break;\
case 0:\
break;\
};\
}\
}
I will not claim that this is the fastest method, since I am not a low level optimization expert. Rather it is an example of a correct architecture dependent implementation that is faster than memset.
Now, onto the results. I calculated performance for size 100 int and long long arrays, both statically and dynamically allocated, but with the exception of msvc, which did a dead code elimination on static arrays, the results were extremely comparable, so I will show only dynamic array performance. Time markings are ms for 1 million iterations, using time.h's low precision clock function.
clang 3.8 (Using the clang-cl frontend, optimization flags= /OX /arch:AVX /Oi /Ot)
int:
memset: 99
fill: 97
ZERO: 98
intrin_ZERO: 90
long long:
memset: 285
fill: 286
ZERO: 285
intrin_ZERO: 188
gcc 5.1.0 (optimization flags: -O3 -march=native -mtune=native -mavx):
int:
memset: 268
fill: 268
ZERO: 268
intrin_ZERO: 91
long long:
memset: 402
fill: 399
ZERO: 400
intrin_ZERO: 185
msvc 2015 (optimization flags: /OX /arch:AVX /Oi /Ot):
int
memset: 196
fill: 613
ZERO: 221
intrin_ZERO: 95
long long:
memset: 273
fill: 559
ZERO: 376
intrin_ZERO: 188
There is a lot interesting going on here: llvm killing gcc, MSVC's typical spotty optimizations (it does an impressive dead code elimination on static arrays and then has awful performance for fill). Although my implementation is significantly faster, this may only be because it recognizes that bit clearing has much less overhead than any other setting operation.
Clang's implementation merits more looking at, as it is significantly faster. Some additional testing shows that its memset is in fact specialized for zero--non zero memsets for 400 byte array are much slower (~220ms) and are comparable to gcc's. However, the nonzero memsetting with an 800 byte array makes no speed difference, which is probably why in that case, their memset has worse performance than my implementation--the specialization is only for small arrays, and the cuttoff is right around 800 bytes. Also note that gcc 'fill' and 'ZERO' are not optimizing to memset (looking at generated code), gcc is simply generating code with identical performance characteristics.
Conclusion: memset is not really optimized for this task as well as people would pretend it is (otherwise gcc and msvc and llvm's memset would have the same performance). If performance matters then memset should not be a final solution, especially for these awkward medium sized arrays, because it is not specialized for bit clearing, and it is not hand optimized any better than the compiler can do on its own.