Let us assume that I have two pointers that are pointing to unrelated addresses that are not cached, so they will both have to come all the way from main memory when being d
Modern CPUs have multiple load buffers so multiple loads can be outstanding at the same time. The memory subsystem is heavily pipelined, giving many parts of it much better throughput than latency. (e.g. with prefetching, Haswell can sustain (from main memory) an 8B load every 1 clock. But the latency if the address isn't known ahead of time is in the hundreds of cycles).
So yes, a Haswell core can keep track of up to 72 outstanding load uops waiting for data from cache / memory. (This is per-core. The shared L3 cache also needs some buffers to handle the whole system's loads / stores to DRAM and memory-mapped IO.)
Haswell's ReOrder Buffer size is 192 uops, so up to 190 uops of work in the code that does not use a or b
can be issued and executed while the loads of a
and b
are the oldest instructions that haven't retired. Instructions / uops are retired in-order to support precise exceptions. The ROB size is basically the limit of the out-of-order window for hiding latency of slow operations like cache-misses.
Also see other links at the x86 tag wiki to learn how CPUs work. Agner Fog's microarch guide is great for having a mental model of the CPU pipeline to let you understand approximately how code will execute.
From David Kanter's Haswell writeup: