I have the following code that writes a global array with zeros twice, once forward and once backward.
#include
#include
#inc
Following asimes answer that it's due to caching - i'm not convinced that you can enjoy the benefit of caches with a ~100M array, you're likely to completely thrash out any useful data before returning there.
However, depending on your platform (OS mostly), there are other mechanisms as work - when you allocate the arrays you never initialize them, so the first loop probably incurs the penalty of the first access per each 4k page. This usually would cause some assist of a syscall that comes with a high overhead.
In this case you also modify the page, so most system would be forced to perform a copy-on-write flow (an optimization that works as long as you read only from a page), this is even heavier.
Adding a small access per page (which should be negligible with regards to actual caching and it only fetches one 64B line out of each 4k page), managed to make the results more even on my system (although this form of measurement isn't very accurate to begin with)
#include <string.h>
#include <time.h>
#include <stdio.h>
#define SIZE 100000000
char c[SIZE];
char c2[SIZE];
int main()
{
int i;
for(i = 0; i < SIZE; i+=4096) //// access and modify each page once
c[i] = 0; ////
clock_t t = clock();
for(i = 0; i < SIZE; i++)
c[i] = 0;
t = clock() - t;
printf("%d\n\n", t);
t = clock();
for(i = SIZE - 1; i >= 0; i--)
c[i] = 0;
t = clock() - t;
printf("%d\n\n", t);
}
If you modify the second loop to be identical to the first the effect is the same, the second loop is faster:
int main() {
int i;
clock_t t = clock();
for(i = 0; i < SIZE; i++)
c[i] = 0;
t = clock() - t;
printf("%d\n\n", t);
t = clock();
for(i = 0; i < SIZE; i++)
c[i] = 0;
t = clock() - t;
printf("%d\n\n", t);
}
This is due to the first loop loading the information into the cache and that information being readily accessible during the second loop
Results of the above:
317841
277270
Edit: Leeor brings up a good point, c
does not fit in the cache. I have an Intel Core i7 processor: http://ark.intel.com/products/37147/Intel-Core-i7-920-Processor-8M-Cache-2_66-GHz-4_80-GTs-Intel-QPI
According to the link, this means the L3 cache is only 8 MB, or 8,388,608 bytes and c
is 100,000,000 bytes
When you defined some global data in C, it is zero-initialized:
char c[SIZE];
char c2[SIZE];
In linux (unix) world this means, than both c
and c2
will be allocated in special ELF file section, the .bss:
... data segment containing statically-allocated variables represented solely by zero-valued bits initially
The .bss
segment is created to not store all zeroes in the binary, it just says something like "this program wants to have 200MB of zeroed memory".
When you program is loaded, ELF loader (kernel in case of classic static binaries, or ld.so
dynamic loader also known as interp
) will allocate the memory for .bss
, usually like something like mmap with MAP_ANONYMOUS
flag and READ+WRITE permissions/protection request.
But memory manager in the OS kernel will not give you all 200 MB of zero memory. Instead it will mark part of virtual memory of your process as zero-initialized, and every page of this memory will point to the special zero page in physical memory. This page has 4096 bytes of zero byte, so if you are reading from c
or c2
, you will get zero bytes; and this mechanism allow kernel cut down memory requirements.
The mappings to zero page are special; they are marked (in page table) as read-only. When you do first write to the any of such virtual pages, the General protection fault or pagefault exception will be generated by hardware (I'll say, by MMU and TLB). This fault will be handled by kernel, and in your case, by minor pagefault handler. It will allocate one physical page, fill it by zero bytes, and reset mapping of just-accesed virtual page to this physical page. Then it will rerun faulted instruction.
I converted your code a bit (both loops are moved to separate function):
$ cat b.c
#include <string.h>
#include <time.h>
#include <stdio.h>
#define SIZE 100000000
char c[SIZE];
char c2[SIZE];
void FIRST()
{
int i;
for(i = 0; i < SIZE; i++)
c[i] = 0;
}
void SECOND()
{
int i;
for(i = 0; i < SIZE; i++)
c[i] = 0;
}
int main()
{
int i;
clock_t t = clock();
FIRST();
t = clock() - t;
printf("%d\n\n", t);
t = clock();
SECOND();
t = clock() - t;
printf("%d\n\n", t);
}
Compile with gcc b.c -fno-inline -O2 -o b
, then run under linux's perf stat
or more generic /usr/bin/time
to get pagefault count:
$ perf stat ./b
139599
93283
Performance counter stats for './b':
....
24 550 page-faults # 0,100 M/sec
$ /usr/bin/time ./b
234246
92754
Command exited with non-zero status 7
0.18user 0.15system 0:00.34elapsed 99%CPU (0avgtext+0avgdata 98136maxresident)k
0inputs+8outputs (0major+24576minor)pagefaults 0swaps
So, we have 24,5 thousands of minor pagefaults. With standard page size on x86/x86_64 of 4096 this is near 100 megabytes.
With perf record
/perf report
linux profiler we can find, where pagefaults occur (are generated):
$ perf record -e page-faults ./b
...skip some spam from non-root run of perf...
213322
97841
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.018 MB perf.data (~801 samples) ]
$ perf report -n |cat
...
# Samples: 467 of event 'page-faults'
# Event count (approx.): 24583
#
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. .......................
#
98.73% 459 b b [.] FIRST
0.81% 1 b libc-2.19.so [.] __new_exitfn
0.35% 1 b ld-2.19.so [.] _dl_map_object_deps
0.07% 1 b ld-2.19.so [.] brk
....
So, now we can see, that only FIRST
function generates pagefaults (on first write to bss pages), and SECOND
does not generate any. Every pagefault corresponds to some work, done by OS kernel, and this work is done only one time per page of bss (because bss is not unmapped and remapped back).