I have always had the idea that reducing the number of iterations is the way to making programs more efficient. Since I never really confirmed that, I set out to te
I believe it's more complex than that. Whether a single loop is faster than multiple loops depends on a few factors.
The very fact that program iterates over a set of data costs you something (incrementing the iterator or index; comparing the iterator/index to some value that lets you know that the loop is finshed) so if you divide a loop into a couple of smaller loops you pay more for iterating over the same set of data multiple times.
On the other hand if the loop is smaller then optimizer has easier job and has more ways to optimize the code. CPU also has possibilities to make loops run faster and usually it works best with small loop.
I've had pieces of code which became faster after dividing a loop into smaller ones. I also wrote algorithms which turned out to perform better when I merged a couple of loops into one loop.
Generally there are many factors and it's difficult to predict which one dominates so the answer is you should always measure and check a few versions of code to find out which one is faster.
There are three important things here:
1) Benchmarking without optimization is meaningless. It turns out that there's a real effect under this which doesn't go away with optimization. In fact, an anti-optimized debug build was hiding a lot of the difference under the extra cost of storing loop counters in memory (limiting loops to 1 per 6 clocks vs. 1 per clock), plus not auto-vectorizing the store loops.
If you didn't already know the asm + CPU microarchitectural details of why there's a speed diff, it wasn't safe or useful to measure it with optimization disabled.
2) Cache conflict misses (if the arrays are all aligned the same relative to a page boundary). Skewing the arrays relative to each other could help a lot. This can happen naturally depending on how they're allocated, even if their sizes aren't large powers of 2.
The arrays are all large and were allocated separately with new
, so they're probably all page-aligned (or offset by 16B from a page boundary in implementations that put some info (like a size) before the object). On Linux, glibc malloc/new typically handles large allocations by allocating fresh pages from the OS with mmap()
(and using the first 16 bytes for bookkeeping for that block), rather than moving the brk()
.
4k aliasing means they all go to the same set in a typical L1d cache, which is 8-way associative on typical x86 CPUs. Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? explains why it's not a coincidence that 64 sets * 64B/line = 4096B page-size (times 8-way = 32kiB), because that makes VIPT L1d cache work like a PIPT without homonym/synonym problems. See also Which cache mapping technique is used in intel core i7 processor?
The 9th store will evict the cache line from the 1st store, so lines will be evicted once per each store, not fully written like in the contiguous case. (Unless the compiler auto-vectorizes and does a whole cache-line full of stores to one array before moving on.) x86's strongly-ordered memory model requires committing stores from the store buffer to L1d in program order, so it can't merge non-adjacent stores to the same line into one entry before commit, or commit multiple outstanding stores when a line does come in if they're not consecutive.
(The replacement policy is pseudo-LRU, not true LRU, so you might sometimes find that a line is still hot after 8 or 9 evictions in the same set.)
Reminder: the above only applies if all the arrays have the same alignment relative to a page. Over-allocating and doing ptr = 128 + malloc(128 + size)
for one of the pointers can skew it relative to the others, and this is sometimes worth doing.
You say you have a PC, so I'm guessing an Intel CPU. (Ryzen's L1d has the same geometry, but Bulldozer-family doesn't.)
(Intel's optimization manual section 3.6.10 Write Combining recommends loop fission for loops that write more than 4 output streams This advice is in a section about NT stores and WC memory; it may only be intended to apply to that case. Either way 4 isn't the right number for modern Intel, unless you're being conservative to account for the other hyperthread.
(Intel's) Assembly/Compiler Coding Rule 58. (H impact, L generality) If an inner loop writes to more than four arrays (four distinct cache lines), apply loop fission to break up the body of the loop such that only four arrays are being written to in each iteration of each of the resulting loops.
TL:DR: for NT stores (cache bypassing), up to 12 output streams seems ok on Skylake and newer, or 10 on Broadwell/Haswell and older. (Or fewer if you're reading any memory at the same time). That's the number of LFBs (Line Fill Buffers) on those CPUs. Earlier CPUs (before Nehalem) had fewer than 10, and maybe couldn't use all of them for NT stores. (Where is the Write-Combining Buffer located? x86) LFBs are used for all transfers of lines to/from L1d, so e.g. a pending load miss needs an LFB allocated to be waiting for that line from L2.
(With hyperthreading, keep in mind that the other hyperthread is competing for LFBs on the same physical core, so don't depend on using all 12 LFBs unless you can disable HT.)
But you're not doing NT stores.
The conventional wisdom was that this 4-output efficiency limit applied to normal (non-NT) stores to WB memory as well, but that is not the case on modern Intel. It was a coincidence that performance for normal (WB = write-back) stores fell off at about the same number of output streams as for NT stores. That mechanical sympathy article takes some guesses at the reason, but we're pretty sure they don't sound right.
See https://github.com/Kobzol/hardware-effects/issues/1 for some microbenchmarks. (And see discussion between myself, BeeOnRope, and Hadi Brais about LFBs where this 4-output guideline came up: https://chat.stackoverflow.com/transcript/message/45474939#45474939 which was previously in comments under Size of store buffers on Intel hardware? What exactly is a store buffer?
@BeeOnRope also posted a bar graph for regular (non-NT) stores interleaved to 1 to 15 output streams on Skylake. Performance is somewhat constant for any number of streams up to about 6 on Skylake, then it starts to get worse at 7 and 8 (maybe from L1d conflict misses if the arrays were all aligned the same way), and more significantly from 9 and up until getting close to a plateau at 13 to 15. (At about 1/3rd the performance of the 1 to 6 stream good case).
Again, with Hyperthreading, the other logical core will almost certainly be generating some memory traffic if it's running at all, so a conservative limit like 4 output streams is not a bad plan. But performance doesn't fall off a cliff at 7 or 8, so don't necessarily fission your loops if that costs more total work.
See also Enhanced REP MOVSB for memcpy for more about regular RFO stores vs. no-RFO NT stores, and lots of x86 memory bandwidth issues. (Especially that memory/L3 cache latency limits single-core bandwidth on most CPUs, but it's worse on many-core Xeons: they surprisingly have lower single-core memory bandwidth than a quad-core desktop. With enough cores busy, you can saturate their high aggregate bandwidth from quad or 6 channel memory controllers; that's the situation they're optimized for.)
2.5) DRAM page locality: write-back to memory happens when data is eventually evicted from L3 (Last level cache). The dirty cache lines get sent to the memory controller which can buffer and batch them up into groups, but there will still be a mix of stores (and RFO loads) to all 10 arrays. A dual channel memory controller can't have 10 DRAM pages open at once. (I think only 1 per channel, but I'm not an expert on DRAM timings. See Ulrich Drepper's What Every Programmer Should Know About Memory which does have some details.) https://pubweb.eng.utah.edu/~cs6810/pres/12-6810-15c.pdf mentions DRAM open/closed page policies for streaming vs. scattered stores.
The bottom line here is that even if cache could handle many output streams, DRAM is probably happier with fewer. Note that a DRAM "page" is not the same size as a virtual memory page (4k) or hugepage (2M).
Speaking of virtual memory, the TLB should be fine with 10 output streams: modern x86 CPUs have many more than 10 L1dTLB entries. Hopefully they're associative enough, or the entries don't all alias, so we don't get a TLB-miss on every store!
@RichardHodges spotted this one)
Your big combined loop doesn't auto-vectorize with gcc or clang. They can't prove that list1[10]
isn't also list4[9]
or something, so they can't store list1[8..11]
with a single 16-byte store.
But the single-array loops can easily auto-vectorize with SSE or AVX. (Surprisingly not to a wmemset
call or something, just with the regular auto-vectorizer only at gcc -O3
, or clang -O2
. That might switch to NT stores for large sizes, which would help most if multiple cores are competing for memory bandwidth. memset pattern-recognition is / would be useful even without auto-vectorization.)
The only alias analysis required here is to prove that list1[i] = 2
doesn't modify the list1
pointer value itself (because the function reads the global inside the loop, instead of copying the value to a local). Type-based aliasing analysis (-fstrict-aliasing
is on by default) allows the compiler to prove that, and/or the fact that if list1
was pointing to itself, there would be undefined behaviour from accessing outside the object in later loop iterations.
Smart compilers can and do check for overlap before auto-vectorizing in some cases (e.g. of output arrays against input arrays) when you fail to use the __restrict
keyword (borrowed by several compilers from C's restrict). If there is overlap, they fall back to a safe scalar loop.
But that doesn't happen in this case: gcc and clang don't generate a vectorized loop at all, they just do scalar in myFunc1
. If each store causes a conflict miss in L1d, this makes this 4x worse than if you'd given the compiler enough info to do its job. (Or 8x with AVX for 32-byte stores). Normally the difference between 16B vs. 32B stores is minor when main memory bandwidth is the bottleneck (not L1d cache), but here it could a big deal because 10 output streams breaks the write-combining effect of L1d if they all alias.
BTW, making the global variables static int *__restrict line1
and so on does allow gcc to auto-vectorize the stores in myFunc1
. It doesn't fission the loop, though. (It would be allowed to, but I guess it's not looking for that optimization. It's up to the programmer to do that.)
// global modifier allows auto-vec of myFunc1
#define GLOBAL_MODIFIER __restrict
#define LOCAL_MODIFIER __restrict // inside myFunc1
static int *GLOBAL_MODIFIER list1, *GLOBAL_MODIFIER list2,
*GLOBAL_MODIFIER list3, *GLOBAL_MODIFIER list4,
*GLOBAL_MODIFIER list5, *GLOBAL_MODIFIER list6,
*GLOBAL_MODIFIER list7, *GLOBAL_MODIFIER list8,
*GLOBAL_MODIFIER list9, *GLOBAL_MODIFIER list10;
I put your code on the Godbolt compiler explorer with gcc8.1 and clang6.0, with that change + a function that reads from one of the arrays to stop them from optimizing away entirely (which they would because I made them static
.)
Then we get this inner loop which should probably run 4x faster than the scalar loop doing the same thing.
.L12: # myFunc1 inner loop from gcc8.1 -O3 with __restrict pointers
movups XMMWORD PTR [rbp+0+rax], xmm9 # MEM[base: l1_16, index: ivtmp.87_52, offset: 0B], tmp108
movups XMMWORD PTR [rbx+rax], xmm8 # MEM[base: l2_17, index: ivtmp.87_52, offset: 0B], tmp109
movups XMMWORD PTR [r11+rax], xmm7 # MEM[base: l3_18, index: ivtmp.87_52, offset: 0B], tmp110
movups XMMWORD PTR [r10+rax], xmm6 # MEM[base: l4_19, index: ivtmp.87_52, offset: 0B], tmp111
movups XMMWORD PTR [r9+rax], xmm5 # MEM[base: l5_20, index: ivtmp.87_52, offset: 0B], tmp112
movups XMMWORD PTR [r8+rax], xmm4 # MEM[base: l6_21, index: ivtmp.87_52, offset: 0B], tmp113
movups XMMWORD PTR [rdi+rax], xmm3 # MEM[base: l7_22, index: ivtmp.87_52, offset: 0B], tmp114
movups XMMWORD PTR [rsi+rax], xmm2 # MEM[base: l8_23, index: ivtmp.87_52, offset: 0B], tmp115
movups XMMWORD PTR [rcx+rax], xmm1 # MEM[base: l9_24, index: ivtmp.87_52, offset: 0B], tmp116
movups XMMWORD PTR [rdx+rax], xmm0 # MEM[base: l10_25, index: ivtmp.87_52, offset: 0B], tmp117
add rax, 16 # ivtmp.87,
cmp rax, 40000000 # ivtmp.87,
jne .L12 #,
(This is compiling for x86-64, of course. x86 32-bit doesn't have enough registers to keep all the pointers in regs, so you'd have a few loads. But those would hit in L1d cache and not actually be much of a throughput bottleneck: at a 1 store per clock bottleneck, there's plenty of throughput to get some more work done in this case where you're just storing constants.)
This optimization is like unrolling the loop 4x and the re-arranging to group 4 stores to each array together. This is why it can't be done if the compiler doesn't know about them being non-overlapping. clang doesn't do it even with __restrict
, unfortunately. The normal use of __restrict
to promise non-overlapping is on function args, not locals or globals, but I didn't try that.
With global arrays instead of global pointers, the compiler would know they didn't overlap (and there wouldn't be a pointer value stored in memory anywhere; the array addresses would be link-time constants.) In your version, the arrays themselves have dynamic storage and it's only the pointers to them that have static storage.
What if myFunc1 stored 64 bytes to one array before moving on to the next? Then your compiler could safely compile it to 4 (SSE), 2 (AVX), or 1 (AVX512) vector stores per array per iteration, covering a full 64 bytes.
If you aligned your pointers by 64 (or if the compiler did some alias analysis and got to the first 64-byte boundary in each output array), then each block of stores would fully write a cache line, and we wouldn't touch it again later.
That would avoid L1d conflict-misses, right? Well maybe, but unless you use NT stores to avoid RFOs, the HW prefetchers need to be pulling lines into L2 and then into L1d before the stores try to commit. So it's not as simple as you might think, but the write-combining buffers that combine stores to cache lines that haven't arrived yet can help.
The L2 streamer prefetcher in Intel CPUs can track 1 forward and 1 backward access per page, so it should be ok (if the arrays don't alias in L2). It's the L1d prefetching that's the big problem.
It would still greatly reduce the amount of cache lines bouncing to/from L2. If you ever have a loop that can't fission easily into multiple loops, at least unroll it so you can write a full cache line before moving on
AVX512 might make a difference; IDK if an aligned vmovdqa64 [mem], zmm0
on Skylake-AVX512 can maybe skip loading the old value when getting the cache line into MESI Modified state, because it knows it's overwriting the entire cache line. (If done without merge-masking).
gcc8.1 doesn't bother to align output pointers even with AVX512; a possibly-overlapping first and last vector would probably be a good strategy for easy cases like this where writing the same memory twice isn't a problem. (Alignment makes more difference for AVX512 than for AVX2 on Skylake hardware.)
4) Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake shows that interleaving dummy writes (to the same location) with a stream of stores can make it worse than 1 contiguous stream, for L1d / L2 bandwidth.
Possibly because of store-merging / coalescing happening in the store buffer before commit to L1d cache. But only for adjacent stores to the same cache line (because x86's strongly-ordered memory model can't allow stores to commit to L1d out of order).
That test doesn't suffer from the cache-conflict problems. But writing a whole cache line contiguously should help some there, too.
When trying to benchmark code, you need to:
You didn't do both. You could use -O3
for example, and as for the average, I did this (I made the function return an element from a list):
for(int i = 0; i < 100; ++i)
dummy = myFunc1();
Then, I got an output like this:
Time taken by func1 (micro s):206693
Time taken by func2 (micro s):37898
That confirms what you saw, but the difference is an order of magnitude (which is a very big deal).
In single for-loop, you do the housekeeping once and the counter of the loop is incremented once. In several for-loops, this is expanded (and you need to do it as many times as the for-loops you have). When the body of the loop is a bit trivial, like in your case, then it can make a difference.
Another issue is data locality. The second function has loops that will populate one list at a time (meaning that the memory will be accessed in a contiguous fashion). In your big loop in the first function, you will fill one element of a list a time, which boils down to random access of memory (since when list1
for example will be brought into the cache, because you filled an element of it, then in the next line of your code, you will request list2
, meaning that list1
is useless now. However, in the second function, once you bring list1
in the cache, you will continue using it from the cache (rather than having to fetch it from memory), which results in major speedup).
I believe that this fact dominates over the other (big loop VS several small ones) here. So, you are not actually benchmarking what you wanted to, but rather random memory access VS contiguous memory access.
This code creates the variables:
list1 = new int[n]; list2 = new int[n];
list3 = new int[n]; list4 = new int[n];
list5 = new int[n]; list6 = new int[n];
list7 = new int[n]; list8 = new int[n];
list9 = new int[n]; list10 = new int[n];
but it almost certainly does not create the actual physical page mappings until the memory is actually modified. See Does malloc lazily create the backing pages for an allocation on Linux (and other platforms)? for an example.
So your func1()
has to wait for the creation of the actual physical pages of RAM, whereas your func2()
doesn't. Change the order, and the mapping time will be attributed to func2()
performance.
The easiest solution given your code as posted is to run either func1()
or func2()
before doing your timed runs.
If you don't ensure that the actual physical memory has been mapped before you do any benchmarking, that mapping will be part of the time you measure when you first modify the memory.
Your assumptions are basically flawed:
Loop iteration does not incur significant cost.
This is what CPUs are optimized for: Tight loops. CPU optimizations can go as far as to use dedicated circuitry for the loop counter (PPCs bdnz
instruction for example) so that the loop counter overhead is exactly zero. X86 does need a CPU cycle or two afaik, but that's it.
What kills your performance is generally memory accesses.
Fetching a value from L1 cache already takes a latency of three to four CPU cycles. A single load from L1 cache has more latency than your loop control! More for higher level caches. RAM access takes forever.
So, to get good performance, you generally need to reduce the time spent on accessing memory. That can be done either by
Avoiding memory accesses.
Most effective, and most easily forgotten optimization. You do not pay for what you don't do.
Parallelizing memory accesses.
Avoid loading some value and have the address of the next needed value depend on this. This optimization is tricky to do as it needs a clear understanding of the dependencies between the different memory accesses.
This optimization may require some loop-fusion or loop unroling to exploit the independences between the different loop bodies/iterations. In your case, the loop iterations are independent from each other, so they are already as parallel as can be.
Also, as MSalters rightly points out in the comments: The CPU has a limited amount of registers. How many depends on the architecture, a 32 bit X86 CPU only has eight for instance. Thus, it simply cannot handle ten different pointers at the same time. It will need to store some of the pointers on stack, introducing even more memory accesses. Which is, obviously, in violation of the point above about avoiding memory accesses.
Sequentialize memory accesses.
CPUs are built with the knowledge that the vast majority of memory accesses is sequential, and they are optimized for this. When you start to access an array, the CPU will generally notice pretty quickly, and start prefetching the subsequent values.
The last point is where your first function fails: You are jumping back and forth between accessing 10 different arrays at 10 totally different memory locations. This reduces the CPUs ability to deduce which cache lines it should prefetch from main memory, and thus reduces overall performance.
If I had to hazard a guess I would say that what you are seeing is the result of more frequent memory cache misses in the first function.
myFunc1()
is essentially performing 10e8 memory writes in a random-access manner.
myFunc2()
is performing 10x sequential memory writes of 10e7 words.
On a modern memory architecture I'd expect the second to be more efficient.