I have read the wikipedia page about out-of-order execution and speculative exectution.
What I fail to understant though are the similarities and differences. It se
Speculative execution and out-of-order execution are orthogonal. One could design a processor this OoO but not speculative or speculative but in-order. OoO execution is an execution model in which instructions can be executed in an order that is potentially different from the program order. However, the instructions are still retired in program order so that the program's observed behavior is the same as the one intuitively expected by the programmer. (Although it's possible to design an OoO processor that retires instructions in some unnatural order with certain constraints. See the simulation-based study on this idea: Maximizing Limited Resources: a Limit-Based Study and Taxonomy of Out-of-Order Commit).
Speculative execution, broadly speaking, is an execution model in which instructions can be fetched and enter the pipeline and even begin execution without even knowing for sure that they will indeed be required to execute (according to the control flow of the program). The term is also often used to specifically refer to speculative execution in the execution stage of the pipeline. The Meltdown paper does define these terms on page 3:
In this paper, we refer to speculative execution in a more restricted meaning, where it refers to an instruction sequence following a branch, and use the term out-of-order execution to refer to any way of getting an operation executed before the processor has committed the results of all prior instructions.
Note that instructions can be executed speculatively, yet in-order. When the decoding stage of the pipeline identifies a conditional branch instruction, it can speculate on the branch and its target and fetch instructions from the predicted target location. But still, instructions can also be executed in-order. However, note that once the speculated conditional branch instruction and the instructions fetched from the predicted path (or both paths) reach the issue stage, none of them will be issued until all earlier instructions retire. When that happens, the processor would know whether the prediction was correct and flush the pipeline otherwise.
Processors designed to carry out simple tasks and used in embedded systems or IoT devices are typically neither speculative nor OoO. Desktop and server processors are both speculative and OoO. In the middle of the computing spectrum (mobile phones and microcontrollers), you can find processors that are OoO, but not speculative (such as the ARM Cortex-A9). The Intel Bonnell microarchitecture is speculative, but in-order. Speculative execution is particularly beneficial when used with OoO.
The confusion came when I read the papers of Meltdown and Spectre and did additional research. It is stated in the Meltdown paper that Meltdown is based on out-of-order execution, while some other resources including the wiki page about sepeculative execution state that Meltdown is based on speculative execution.
The Meltdown vulnerability as described in the paper requires both speculative and out-of-order execution. However, this is somewhat a vague statement since there are many different speculative and out-of-order execution implementations. Meltdown doesn't work with just any type of OoO or speculative execution. For example, ARM11 (used in Raspberry Pis) supports some limited OoO and speculative execution, but it's not vulnerable.
See Peter's answer for more details on Meltdown and his other answer.
Related: What is the difference between Superscalar and OoO execution?.
I'm still having hard time figuring out, how Meltdown uses speculative execution. The example in the paper (the same one I mentioned here earlier) uses IMO only OoO - @Name in a comment
Meltdown is based on Intel CPUs optimistically speculating that loads won't fault, and that if a faulting load reaches the load ports, that it was the result of an earlier mispredicted branch. So the load uop gets marked so it will fault if it reaches retirement, but execution continues speculatively using data the page table entry says you aren't allowed to read from user-space.
Instead of triggering a costly exception-recovery when the load executes, it waits until it definitely reaches retirement, because that's a cheap way for the machinery to handle the branch miss -> bad load case. In hardware, it's easier for the pipe to keep piping unless you need it to stop / stall for correctness. e.g. A load where there's no page-table entry at all, and thus a TLB miss, has to wait. But waiting even on a TLB hit (for an entry with permissions that block using it) would be added complexity. Normally a page-fault is only ever raised after a failed page walk (which doesn't find an entry for the virtual address), or at retirement of a load or store that failed the permissions of the TLB entry it hit.
In a modern OoO pipelined CPU, all instructions are treated as speculative until retirement. Only at retirement do instructions become non-speculative. The Out-of-Order machinery doesn't really know or care whether it's speculating down one side of a branch that was predicted but not executed yet, or speculating past potentially-faulting loads. "Speculating" that loads don't fault or ALU instructions don't raise exceptions happens even in CPUs that aren't really considered speculative, but fully out-of-order execution turns that into just another kind of speculation.
I'm not too worried about an exact definition for "speculative execution", and what counts / what doesn't. I'm more interested in how modern out-of-order designs actually work, and that it's actually simpler to not even try to distinguish speculative from non-speculative until the end of the pipeline. This answer isn't even trying to address simpler in-order pipelines with speculative instruction-fetch (based on branch prediction) but not execution, or anywhere in between that and full-blown Tomasulo's algorithm with a ROB + scheduler with OoO exec + in-order retirement for precise exceptions.
For example, only after retirement can a store ever commit from the store buffer to L1d cache, not before. And to absorb short bursts and cache misses, it doesn't have to happen as part of retirement either. So one of the only non-speculative out-of-order things is committing stores to L1d; they have definitely happened as far as the architectural state is concerned, so they have to be completed even if an interrupt / exception happens.
The fault-if-reaching-retirement mechanism is a good way to avoid expensive work in the shadow of a branch mispredict. It also gives the CPU the right architectural state (register values, etc.) if the exception does fire. You do need that whether or not you let the OoO machinery keep churning on instructions beyond a point where you've detected an exception.
Branch-misses are special: there are buffers that record micro-architectural state (like register-allocation) on branches, so branch-recovery can roll back to that instead of flushing the pipeline and restarting from the last known-good retirement state. Branches do mispredict a fair amount in real code. Other exceptions are very rare.
Modern high-performance CPUs can keep (out-of-order) executing uops from before a branch miss, while discarding uops and execution results from after that point. Fast recovery is a lot cheaper than discarding and restarting everything from a retirement state that's potentially far behind the point where the mispredict was discovered.
E.g. in a loop, the instructions that handle the loop counter might get far ahead of the rest of the loop body, and detect the mispredict at the end soon enough to redirect the front-end and maybe not lose much real throughput, especially if the bottleneck was the latency of a dependency chain or something other than uop throughput.
This optimized recovery mechanism is only used for branches (because the state-snapshot buffers are limited), which is why branch misses are relatively cheap compared to full pipeline flushes. (e.g. on Intel, memory-ordering machine clears, performance counter machine_clears.memory_ordering
: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?)
Exceptions are not unheard-of, though; page-faults do happen in the normal course of operation. e.g. store to a read-only page triggers copy-on-write. Load or store to an unmapped page triggers page-in or handling the lazy mapping. But thousands to millions of instructions usually run between every page fault even in a process that's allocating new memory frequently. (1 per micro or milli-second on a 1GHz CPU). In code that doesn't map new memory, you can go far longer without exceptions. Mostly just a timer interrupt occasionally in pure number crunching without I/O.
But anyway, you don't want to trigger a pipeline flush or anything expensive until you're sure that an exception will really fire. And that you're sure you have the right exception. e.g. maybe the load address for an earlier faulting load wasn't ready as soon, so the first faulting load to execute wasn't the first in program order. Waiting until retirement is a cheap way to get precise exceptions. Cheap in terms of additional transistors to handle this case, and letting the usual in-order retirement machinery figure out exactly which exception fires is fast.
The useless work done executing instructions after an instruction marked to fault on retirement costs a tiny bit of power, and isn't worth blocking because exceptions are so rare.
This explains why it makes sense to design hardware that was vulnerable to Meltdown in the first place. Obviously it's not safe to keep doing this, now that Meltdown has been thought of.
We don't need to block speculative execution after a faulting load; we just need to make sure it doesn't actually use sensitive data. It's not the load succeeding speculatively that's the problem, Meltdown is based on the following instructions using that data to produce data-dependent microarchitectural effects. (e.g. touching a cache line based on the data).
So if the load ports mask the loaded data to zero or something as well as setting the fault-on-retirement flag, execution continues but can't gain any info about the secret data. This should take about 1 extra gate delay of critical path, which is probably possible in the load ports without limiting the clock speed or adding an extra cycle of latency. (1 clock cycle is long enough for logic to propagate through many AND/OR gates within a pipeline stage, e.g. a full 64-bit adder).
Related: I suggested the same mechanism for a HW fix for Meltdown in Why are AMD processors not/less vulnerable to Meltdown and Spectre?.