Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%
I'm attempting to obtain full bandwidth in the L1 cache for the following function on Intel processors float triad(float *x, float *y, float *z, const int n) { float k = 3.14159f; for(int i=0; i<n; i++) { z[i] = x[i] + k*y[i]; } } This is the triad function from STREAM . I get about 95% of the peak with SandyBridge/IvyBridge processors with this function (using assembly with NASM). However, using Haswell I only achieve 62% of the peak unless I unroll the loop. If I unroll 16 times I get 92%. I don't understand this. I decided to write my function in assembly using NASM. The main loop in