I\'m trying to optimize a computation-intensive algorithm and am kind of stuck at some cache problem. I have a huge buffer which is written occasionally and at random and read o
Shouldn't func4 be this:
void func4() {
__m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
for(int i = 0; i < length; i += 16) {
_mm_stream_ps(&arr[i], buf);
_mm_stream_ps(&arr[i+4], buf);
_mm_stream_ps(&arr[i+8], buf);
_mm_stream_ps(&arr[i+12], buf);
}
}
malloc
, but on first touch, inside your func*
functions. OS may also do some memory shuffles after large amount of memory is allocated, so any benchmarks, performed just after memory allocations, may be not reliable.arr
value from memory instead of using a register. This may cost some performance decrease. Easiest way to avoid aliasing is to copy arr
and length
to local variables and use only local variables to fill the array. There are many well-known advices to avoid global variables. Aliasing is one of the reasons._mm_stream_ps
works better if array is aligned by 64 bytes. In your code no alignment is guaranteed (actually, malloc
aligns it by 16 bytes). This optimization is noticeable only for short arrays._mm_mfence
after you finished with _mm_stream_ps
. This is needed for correctness, not for performance.