I need help with the performance of the following code. It does a memcpy on two dynamically allocated arrays of arbitrary size:
int main()
{
double *a, *b;
u
The first bzero runs longer because of (1) lazy page allocation and (2) lazy page zero-initialization by kernel. While second reason is unavoidable because of security reasons, lazy page allocation may be optimized by using larger ("huge") pages.
There are at least two ways to use huge pages on Linux. Hard way is hugetlbfs. Easy way is Transparent huge pages.
Search khugepaged
in the list of processes on your system. If such process exists, transparent huge pages are supported, you can use them in your application if you change malloc
to this:
posix_memalign((void **)&b, 2*1024*1024, n*sizeof(double));
madvise((void *)b, n*sizeof(double), MADV_HUGEPAGE);
Surely if you are comparing the speed of initialise and copy to the speed of just copy, then the initialisation should be included in timed section. It appears to me you should actually be comparing this:
// Version 1
for(i=0; i<n; i++)
a[i] = 1.0;
tic();
memcpy(b, a, n*sizeof(double));
toc();
To this:
// Version 2
for(i=0; i<n; i++)
a[i] = 1.0;
tic();
for(i=0; i<n; i++)
b[i] = 0.0;
memcpy(b, a, n*sizeof(double));
toc();
I expect this will see your 3x speed improvement drop sharply.
EDIT: As suggested by Steve Jessop, you may also want to test a third strategy of only touching one entry per page:
// Version 3
for(i=0; i<n; i++)
a[i] = 1.0;
tic();
for(i=0; i<n; i+=DOUBLES_PER_PAGE)
b[i] = 0.0;
memcpy(b, a, n*sizeof(double));
toc();
It's probably lazy page allocation, Linux only mapping the pages on first access. IIRC each page in a new block in Linux is a copy-on-write of a blank page, and your allocations are big enough to demand new blocks.
If you want to work around it, you could write one byte or word, at 4k intervals. That might get the virtual addresses mapped to RAM slightly faster than writing the whole of each page.
I wouldn't expect (most efficient workaround to force the lazy memory mapping to happen) plus (copy) to be significantly faster than just (copy) without the initialization of b
, though. So unless there's a specific reason why you care about the performance just of the copy, not of the whole operation, then it's fairly futile. It's "pay now or pay later", Linux pays later, and you're only measuring the time for later.