Let\'s assume that we have a function that multiplies two arrays of 1000000 doubles each. In C/C++ the function looks like this:
void mul_c(double* a, double
There was a major bug in the timing function I used for previous benchmarks. This grossly underestimated the bandwidth without vectorization as well as other measurements. Additionally, there was another problem that was overestimating the bandwidth due to COW on the array that was read but not written to. Finally, the maximum bandwidth I used was incorrect. I have updated my answer with the corrections and I have left the old answer at the end of this answer.
Your operation is memory bandwidth bound. This means the CPU is spending most of its time waiting on slow memory reads and writes. An excellent explanation for this can be found here: Why vectorizing the loop does not have performance improvement.
However, I have to disagree slightly with one statement in that answer.
So regardless of how it's optimized, (vectorized, unrolled, etc...) it isn't gonna get much faster.
In fact, vectorization, unrolling, and multiple threads can significantly increase the bandwidth even in memory bandwidth bound operations. The reason is that it is difficult to obtain the maximum memory bandwidth. A good explanation for this can be found here: https://stackoverflow.com/a/25187492/2542702.
The rest of my answer will show how vectorization and multiple threads can get closer to the maximum memory bandwidth.
My test system: Ubuntu 16.10, Skylake (i7-6700HQ@2.60GHz), 32GB RAM, dual channel DDR4@2400 GHz. The maximum bandwidth from my system is 38.4 GB/s.
From the code below I produce the following tables. I set the number of thread using OMP_NUM_THREADS e.g. export OMP_NUM_THREADS=4
. The efficiency is bandwidth/max_bandwidth
.
-O2 -march=native -fopenmp
Threads Efficiency
1 59.2%
2 76.6%
4 74.3%
8 70.7%
-O2 -march=native -fopenmp -funroll-loops
1 55.8%
2 76.5%
4 72.1%
8 72.2%
-O3 -march=native -fopenmp
1 63.9%
2 74.6%
4 63.9%
8 63.2%
-O3 -march=native -fopenmp -mprefer-avx128
1 67.8%
2 76.0%
4 63.9%
8 63.2%
-O3 -march=native -fopenmp -mprefer-avx128 -funroll-loops
1 68.8%
2 73.9%
4 69.0%
8 66.8%
After several iterations of running due to uncertainties in the measurements I have formed the following conclusions:
The solution that gives the best bandwidth is scalar operations with two threads.
The code I used to benchmark:
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <omp.h>
#define N 10000000
#define R 100
void mul(double *a, double *b) {
#pragma omp parallel for
for (int i = 0; i<N; i++) a[i] *= b[i];
}
int main() {
double maxbw = 2.4*2*8; // 2.4GHz * 2-channels * 64-bits * 1-byte/8-bits
double mem = 3*sizeof(double)*N*R*1E-9; // GB
double *a = (double*)malloc(sizeof *a * N);
double *b = (double*)malloc(sizeof *b * N);
//due to copy-on-write b must be initialized to get the correct bandwidth
//also, GCC will convert malloc + memset(0) to calloc so use memset(1)
memset(b, 1, sizeof *b * N);
double dtime = -omp_get_wtime();
for(int i=0; i<R; i++) mul(a,b);
dtime += omp_get_wtime();
printf("%.2f s, %.1f GB/s, %.1f%%\n", dtime, mem/dtime, 100*mem/dtime/maxbw);
free(a), free(b);
}
The modern solution for inline assembly is to use intrinsics. There are still cases where one needs inline assembly but this is not one of them.
One intrinsics solution for you inline assembly approach is simply:
void mul_SSE(double* a, double* b) {
for (int i = 0; i<N/2; i++)
_mm_store_pd(&a[2*i], _mm_mul_pd(_mm_load_pd(&a[2*i]),_mm_load_pd(&b[2*i])));
}
Let me define some test code
#include <x86intrin.h>
#include <string.h>
#include <stdio.h>
#include <x86intrin.h>
#include <omp.h>
#define N 1000000
#define R 1000
typedef __attribute__(( aligned(32))) double aligned_double;
void (*fp)(aligned_double *a, aligned_double *b);
void mul(aligned_double* __restrict a, aligned_double* __restrict b) {
for (int i = 0; i<N; i++) a[i] *= b[i];
}
void mul_SSE(double* a, double* b) {
for (int i = 0; i<N/2; i++) _mm_store_pd(&a[2*i], _mm_mul_pd(_mm_load_pd(&a[2*i]),_mm_load_pd(&b[2*i])));
}
void mul_SSE_NT(double* a, double* b) {
for (int i = 0; i<N/2; i++) _mm_stream_pd(&a[2*i], _mm_mul_pd(_mm_load_pd(&a[2*i]),_mm_load_pd(&b[2*i])));
}
void mul_SSE_OMP(double* a, double* b) {
#pragma omp parallel for
for (int i = 0; i<N; i++) a[i] *= b[i];
}
void test(aligned_double *a, aligned_double *b, const char *name) {
double dtime;
const double mem = 3*sizeof(double)*N*R/1024/1024/1024;
const double maxbw = 34.1;
dtime = -omp_get_wtime();
for(int i=0; i<R; i++) fp(a,b);
dtime += omp_get_wtime();
printf("%s \t time %.2f s, %.1f GB/s, efficency %.1f%%\n", name, dtime, mem/dtime, 100*mem/dtime/maxbw);
}
int main() {
double *a = (double*)_mm_malloc(sizeof *a * N, 32);
double *b = (double*)_mm_malloc(sizeof *b * N, 32);
//b must be initialized to get the correct bandwidth!!!
memset(a, 1, sizeof *a * N);
memset(b, 1, sizeof *a * N);
fp = mul, test(a,b, "mul ");
fp = mul_SSE, test(a,b, "mul_SSE ");
fp = mul_SSE_NT, test(a,b, "mul_SSE_NT ");
fp = mul_SSE_OMP, test(a,b, "mul_SSE_OMP");
_mm_free(a), _mm_free(b);
}
Now the first test
g++ -O2 -fopenmp test.cpp
./a.out
mul time 1.67 s, 13.1 GB/s, efficiency 38.5%
mul_SSE time 1.00 s, 21.9 GB/s, efficiency 64.3%
mul_SSE_NT time 1.05 s, 20.9 GB/s, efficiency 61.4%
mul_SSE_OMP time 0.74 s, 29.7 GB/s, efficiency 87.0%
So with -O2
which does not vectorize loops we see that the intrinsic SSE version is much faster than the plain C solution mul
. efficiency = bandwith_measured/max_bandwidth
where the max is 34.1 GB/s for my system.
Second test
g++ -O3 -fopenmp test.cpp
./a.out
mul time 1.05 s, 20.9 GB/s, efficiency 61.2%
mul_SSE time 0.99 s, 22.3 GB/s, efficiency 65.3%
mul_SSE_NT time 1.01 s, 21.7 GB/s, efficiency 63.7%
mul_SSE_OMP time 0.68 s, 32.5 GB/s, efficiency 95.2%
With -O3
vectorizes the loop and the intrinsic function offers essentially no advantage.
Third test
g++ -O3 -fopenmp -funroll-loops test.cpp
./a.out
mul time 0.85 s, 25.9 GB/s, efficency 76.1%
mul_SSE time 0.84 s, 26.2 GB/s, efficency 76.7%
mul_SSE_NT time 1.06 s, 20.8 GB/s, efficency 61.0%
mul_SSE_OMP time 0.76 s, 29.0 GB/s, efficency 85.0%
With -funroll-loops
GCC unrolls the loops eight times and we see a significant improvement except for the non-temporal store solution and not real advantage for OpenMP solution.
Before unrolling the loop the assembly for mul
wiht -O3
is
xor eax, eax
.L2:
movupd xmm0, XMMWORD PTR [rsi+rax]
mulpd xmm0, XMMWORD PTR [rdi+rax]
movaps XMMWORD PTR [rdi+rax], xmm0
add rax, 16
cmp rax, 8000000
jne .L2
rep ret
With -O3 -funroll-loops
the assembly for mul
is:
xor eax, eax
.L2:
movupd xmm0, XMMWORD PTR [rsi+rax]
movupd xmm1, XMMWORD PTR [rsi+16+rax]
mulpd xmm0, XMMWORD PTR [rdi+rax]
movupd xmm2, XMMWORD PTR [rsi+32+rax]
mulpd xmm1, XMMWORD PTR [rdi+16+rax]
movupd xmm3, XMMWORD PTR [rsi+48+rax]
mulpd xmm2, XMMWORD PTR [rdi+32+rax]
movupd xmm4, XMMWORD PTR [rsi+64+rax]
mulpd xmm3, XMMWORD PTR [rdi+48+rax]
movupd xmm5, XMMWORD PTR [rsi+80+rax]
mulpd xmm4, XMMWORD PTR [rdi+64+rax]
movupd xmm6, XMMWORD PTR [rsi+96+rax]
mulpd xmm5, XMMWORD PTR [rdi+80+rax]
movupd xmm7, XMMWORD PTR [rsi+112+rax]
mulpd xmm6, XMMWORD PTR [rdi+96+rax]
movaps XMMWORD PTR [rdi+rax], xmm0
mulpd xmm7, XMMWORD PTR [rdi+112+rax]
movaps XMMWORD PTR [rdi+16+rax], xmm1
movaps XMMWORD PTR [rdi+32+rax], xmm2
movaps XMMWORD PTR [rdi+48+rax], xmm3
movaps XMMWORD PTR [rdi+64+rax], xmm4
movaps XMMWORD PTR [rdi+80+rax], xmm5
movaps XMMWORD PTR [rdi+96+rax], xmm6
movaps XMMWORD PTR [rdi+112+rax], xmm7
sub rax, -128
cmp rax, 8000000
jne .L2
rep ret
Fourth test
g++ -O3 -fopenmp -mavx test.cpp
./a.out
mul time 0.87 s, 25.3 GB/s, efficiency 74.3%
mul_SSE time 0.88 s, 24.9 GB/s, efficiency 73.0%
mul_SSE_NT time 1.07 s, 20.6 GB/s, efficiency 60.5%
mul_SSE_OMP time 0.76 s, 29.0 GB/s, efficiency 85.2%
Now the non-intrinsic function is the fastest (excluding the OpenMP version).
So there is no reason to use intrinsics or inline assembly in this case because we can get the best performance with appropriate compiler options (e.g. -O3
, -funroll-loops
, -mavx
).
Test system: Ubuntu 16.10, Skylake (i7-6700HQ@2.60GHz), 32GB RAM. Maximum memory bandwidth (34.1 GB/s) https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3_50-GHz
Here is another solution worth considering. The cmp instruction is not necessary if we count from -N up to zero and access the arrays as N+i
. GCC should have fixed this a long time ago. It eliminates one instruction (though due to macro-op fusion the cmp and jmp often count as one micro-op).
void mul_SSE_v2(double* a, double* b) {
for (ptrdiff_t i = -N; i<0; i+=2)
_mm_store_pd(&a[N + i], _mm_mul_pd(_mm_load_pd(&a[N + i]),_mm_load_pd(&b[N + i])));
Assembly with -O3
mul_SSE_v2(double*, double*):
mov rax, -1000000
.L9:
movapd xmm0, XMMWORD PTR [rdi+8000000+rax*8]
mulpd xmm0, XMMWORD PTR [rsi+8000000+rax*8]
movaps XMMWORD PTR [rdi+8000000+rax*8], xmm0
add rax, 2
jne .L9
rep ret
}
This optimization will only possibly be helpful the arrays fit e.g. the L1 cache i.e. not reading from main memory.
I finally found a way to get the plain C solution to not generate the cmp
instruction.
void mul_v2(aligned_double* __restrict a, aligned_double* __restrict b) {
for (int i = -N; i<0; i++) a[i] *= b[i];
}
And then call the function from a separate object file like this mul_v2(&a[N],&b[N])
so this is perhaps the best solution. However, if you call the function from the same object file (translation unit) as the one it's defined in the GCC generates the cmp
instruction again.
Also,
void mul_v3(aligned_double* __restrict a, aligned_double* __restrict b) {
for (int i = -N; i<0; i++) a[N+i] *= b[N+i];
}
still generates the cmp
instruction and generates the same assembly as the mul
function.
The function mul_SSE_NT
is silly. It uses non-temporal stores which are only useful when only writing to memory but since the function reads and writes to the same address non-temporal stores are not only useless they give inferior results.
Previous versions of this answer were getting the wrong bandwidth. The reason was when the arrays were not initialized.
I want to add another point of view to the problem. SIMD instructions give big performance boost if there is no memory bound restrictions. But there are too much memory loading and storing operations and too few CPU calculations in current example. So CPU is in time to process incoming data without using SIMD. If you use data of another type (32-bit float for example) or more complex algorithm, memory throughput won't restrict CPU performance and using of SIMD will give more advantages.
Your asm code is really OK. What is not is the way you measure it. As I pointed in comments you should:
a) use way more iterations - 1 million is nothing for modern CPU
b) use HPT for measurement
c) use RDTSC or RDTSCP to count real CPU clocks
Additionally why you are afraid of -O3 opt? Don't forget to build code for your platform so use -march=native. If your CPU supports AVX or AVX2 compiler will take opportunity to produce even better code.
Next thing - give compiler some hints about aliasing and allignment if you know you code.
Here is my version of your mul_c
- yes it is GCC specific but you showed you used GCC
void mul_c(double* restrict a, double* restrict b)
{
a = __builtin_assume_aligned (a, 16);
b = __builtin_assume_aligned (b, 16);
for (int i = 0; i != 1000000; ++i)
{
a[i] = a[i] * b[i];
}
}
It will produce:
mul_c(double*, double*):
xor eax, eax
.L2:
movapd xmm0, XMMWORD PTR [rdi+rax]
mulpd xmm0, XMMWORD PTR [rsi+rax]
movaps XMMWORD PTR [rdi+rax], xmm0
add rax, 16
cmp rax, 8000000
jne .L2
rep ret
If you have AVX2 and make sure data is 32 bytes aligned it will become
mul_c(double*, double*):
xor eax, eax
.L2:
vmovapd ymm0, YMMWORD PTR [rdi+rax]
vmulpd ymm0, ymm0, YMMWORD PTR [rsi+rax]
vmovapd YMMWORD PTR [rdi+rax], ymm0
add rax, 32
cmp rax, 8000000
jne .L2
vzeroupper
ret
So no need for handcrafted asm if compiler can do it for you ;)