disclosure: I\'ve tried similar question on programmers.stack, but that place is nowhere near activity stack is.
Intro
I tend to work with lots
For the things you are doing I would look at SIMD (Single Instruction Multiple Data), google for GCC Compiler Intrinsics for details
Your value for the peak bandwidth from main memory is off by a factor of two. Instead of it 10664 MB/s it should be 21.3 GB/s (more precisely it should be (21333⅓) MB/s - see my derivation below). The fact that you see more than 10664 MB/s sometimes should have told you that maybe there was a problem in your peak bandwidth calculation.
In order to get the maximum bandwidth for Core2 through Sandy Bridge you need to use non-temporal stores. Additionally, you need multiple threads. You don't need AVX instructions or to unroll the loop.
void copy(char *x, char *y, int n)
{
#pragma omp parallel for schedule(static)
for(int i=0; i<n/16; i++)
{
_mm_stream_ps((float*)&y[16*i], _mm_load_ps((float*)&x[16*i]));
}
}
The arrays need to be 16 byte aligned and also be a multiple of 16. The rule of thumb for non-temporal stores is to use them when the memory you are copying is larger than half the size of last level cache. In your case half the L3 cache size is 1.5 MB and the smallest array you copy is 8 MB so this is much larger than half the last level cache size.
Here is some code to test this.
//gcc -O3 -fopenmp foo.c
#include <stdio.h>
#include <x86intrin.h>
#include <string.h>
#include <omp.h>
void copy(char *x, char *y, int n)
{
#pragma omp parallel for schedule(static)
for(int i=0; i<n/16; i++)
{
_mm_stream_ps((float*)&x[16*i], _mm_load_ps((float*)&y[16*i]));
}
}
void copy2(char *x, char *y, int n)
{
#pragma omp parallel for schedule(static)
for(int i=0; i<n/16; i++)
{
_mm_store_ps((float*)&x[16*i], _mm_load_ps((float*)&y[16*i]));
}
}
int main(void)
{
unsigned n = 0x7fffffff;
char *x = _mm_malloc(n, 16);
char *y = _mm_malloc(n, 16);
double dtime;
memset(x,0,n);
memset(y,1,n);
dtime = -omp_get_wtime();
copy(x,y,n);
dtime += omp_get_wtime();
printf("time %f\n", dtime);
dtime = -omp_get_wtime();
copy2(x,y,n);
dtime += omp_get_wtime();
printf("time %f\n", dtime);
dtime = -omp_get_wtime();
memcpy(x,y,n);
dtime += omp_get_wtime();
printf("time %f\n", dtime);
}
On my system, Core2 (before Nehalem) P9600@2.53GHz, it gives
time non temporal store 0.39
time SSE store 1.10
time memcpy 0.98
to copy 2GB.
Note that it's very important that you "touch" the memory you will write to first (I used memset to do this). Your system does not necessarily allocate your memory until you access it. The overhead to do this can bias your results significantly if the memory has not been accesses when you do the memory copy.
According to wikipedia DDR3-1333 has a memory clock of 166⅔ MHz. DDR transfers data at twice memory clock rate. Additionally, DDR3 has a bus clock multiplier of four. So DDR3 has a total multiply per memory clock of eight. Additionally, your motherboard has two memory channels. So the total transfer rate is
21333⅓ MB/s = (166⅔ 1E6 clocks/s) * (8 lines/clock/channel) * (2 channels) * (64-bits/line) * (byte/8-bits) * (MB/1E6 bytes).
You should compile with a recent GCC (so having compiled your GCC 5.2 is a good idea, in November 2015), and you should enable optimizations for your particular platform, so I suggest compiling with gcc -Wall -O2 -march=native
at least (try also to replace -O2
with -O3
).
(Don't benchmark your programs without enabling optimizations in your compiler)
If you are concerned with cache effects, you might play with __builtin_prefetch
, but see this.
Read also about OpenMP, OpenCL, OpenACC.