fma

Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%

ε祈祈猫儿з 提交于 2019-11-26 15:06:08
I'm attempting to obtain full bandwidth in the L1 cache for the following function on Intel processors float triad(float *x, float *y, float *z, const int n) { float k = 3.14159f; for(int i=0; i<n; i++) { z[i] = x[i] + k*y[i]; } } This is the triad function from STREAM . I get about 95% of the peak with SandyBridge/IvyBridge processors with this function (using assembly with NASM). However, using Haswell I only achieve 62% of the peak unless I unroll the loop. If I unroll 16 times I get 92%. I don't understand this. I decided to write my function in assembly using NASM. The main loop in

Optimize for fast multiplication but slow addition: FMA and doubledouble

大兔子大兔子 提交于 2019-11-26 11:34:42
问题 When I first got a Haswell processor I tried implementing FMA to determine the Mandelbrot set. The main algorithm is this: intn = 0; for(int32_t i=0; i<maxiter; i++) { floatn x2 = square(x), y2 = square(y); //square(x) = x*x floatn r2 = x2 + y2; booln mask = r2<cut; //booln is in the float domain non integer domain if(!horizontal_or(mask)) break; //_mm256_testz_pd(mask) n -= mask floatn t = x*y; mul2(t); //mul2(t): t*=2 x = x2 - y2 + cx; y = t + cy; } This determines if n pixels are in the

How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

我的梦境 提交于 2019-11-26 07:28:19
问题 I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX: FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2. I like to know how to do this best in code and I also want to know how it\'s done internally in the CPU. I mean with the super-scalar architecture. Let\'s say I want to do a long sum such as the following in SSE: //sum = a1*b1 + a2*b2 + a3*b3 +... where a is a scalar and b is a SIMD vector (e.g. from matrix multiplication) sum = _mm_set1_ps(0.0f