Problem (think of the mark phase of a GC)
Use 2M hugepages for address ranges that are full of "hot" data that the kernel can't usefully swap out any / many 4k chunks of. This will reduce TLB misses, but costs extra physical memory if there are any 4k chunks of a hugepage that aren't hot.
Linux does this transparently for anonymous pages (https://www.kernel.org/doc/Documentation/vm/transhuge.txt), but you can use madvise(MADV_HUGEPAGE) on pages you know are worth it, to encourage the kernel to defrag physical memory even if that's not the default in /sys/kernel/mm/transparent_hugepage/defrag
. (You can look at /proc/PID/smaps
to see how many transparent hugepages are in use for any given mapping.)
Based on what you posted in your answer: An ordered set of nodesToVisit
would give you the most locality, but might be too expensive to maintain. Multiple accesses within the same 64-byte cache line are much cheaper than coming back to it later after it's been evicted from L3 cache and has to come from DRAM again.
If you have lots of addresses to visit in your Set, doing one pass of a radix-sort into 2M buckets would give you locality within one hugepage. 2M is also smaller than L3 cache size, so you'll probably get some cache hits when visiting multiple objects in the same cache line, even if you don't hit them back to back.
Depending on how big your Set is, throwing around that many pointers even to partial-sort them might not be worth the memory traffic that takes. But there's probably some sweet spot of taking a window of data and at least partially sorting it. Using the pointers before they are evicted from cache is nice.
SW prefetch can trigger a page-walk to avoid a TLB miss, so you could _mm_prefetch(_MM_HINT_T2) one address from the next 2M bucket before starting on the current bucket. See also Prefetching Examples?. I haven't tested this, but it might work well. It won't help with page faults: prefetch from an unmapped page won't cause a page fault, and you don't want to trigger an actual PF until you're ready to touch the page.
MSalter's suggestion to ask the OS to prefetch and wire the next page is interesting (I think madvise(MADV_WILLNEED)
is the Linux equivalent), but a system call will be slow for no benefit if the page was already mapped+wired into the HW page table. There's no x86 asm instruction that just asks if a page is mapped without faulting if it isn't, so I can't think of a way to efficiently choose not to call it. And BTW, I think Linux breaks up transparent hugepages into 4k regular pages for paging in/out. But don't write a big loop that just does _mm_prefetch()
or madvise
on all the 4k pages in a 2M block; that probably sucks. The prefetcht2
part would probably just result in excess prefetch requests being dropped.
Use perf counters to look at cache hit/miss rates. On Intel CPUs, the mem_load_retired.l1_miss
and/or .l2_miss
event should show you whether you're getting cache hits on accessing the Set itself, as well as on accessing dereferencing those pointers. Those counters are precise events, so they should map accurately to asm load instructions. (e.g. perf record -e mem_load_retired.l2_miss ./my_program
/ perf report
on Linux).
We remove one item at a time from
nodesToVisit
I don't know much about GC design, but can't you use a sequence number or tagged-pointer or something to avoid modifying the Set data structure itself every GC pass? If your minimum object alignment is 4 bytes, you have 2 bits to play with at the bottom of every pointer. ANDing them off before dereferencing is very cheap.
x86-64 with full 64-bit pointers currently requires the top 16 to be the sign-extension of the low 48. So you could use bits there (16 bits, or maybe just the top byte) if you re-canonicalize pointers. (redo sign extension, or just zero the high 16 bits if you want to assume user-space pointers; Linux uses a high-half kernel VM layout so user-space addresses are always in the low half of virtual address space. IDK what Windows does.)
On x86-64, you might consider using the x32 ABI (32-bit pointers in long mode) if 4GiB of address space is enough, especially if you're hitting physical memory limits and swapping. Smaller pointers mean smaller data structures, thus half the cache footprint.
Some Linux systems are built without kernel support for x32, though, only classic x86-64 and usually 32-bit mode. But if it works on your systems, consider gcc -mx32
.