The inner workings of Spectre (v2)

I have done some reading about Spectre v2 and obviously you get the non technical explanations. Peter Cordes has a more in-depth explanation but it doesn't fully address a few details. Note: I have never performed a Spectre v2 attack so I do not have hands on experience. I have only read up about about the theory.

My understanding of Spectre v2 is that you make an indirect branch mispredict for instance if (input < data.size). If the Indirect Target Array (which I'm not too sure of the details of -- i.e. why it is separate from the BTB structure) -- which is rechecked at decode for RIPs of indirect branches -- does not contain a prediction then it will insert the new jump RIP (branch execution will eventually insert the target RIP of the branch), but for now it does not know the target RIP of the jump so any form of static prediction will not work. My understanding is it is always going to predict not taken for new indirect branches and when Port 6 eventually works out the jump target RIP and prediction it will roll back using the BOB and update the ITA with the correct jump address and then update the local and global branch history registers and the saturating counters accordingly.

The hacker needs to train the saturating counters to always predict taken which, I imagine, they do by running the if(input < data.size) multiple times in a loop where input is set to something that is indeed less than data.size (catching errors accordingly) and on the final iteration of the loop, make input more than data.size (1000 for instance); the indirect branch will be predicted taken and it will jump to the body of the if statement where the cache load takes place.

The if statement contains secret = data[1000] (A particular memory address (data[1000]) that contains secret data is targeted for loading from memory to cache) then this will be allocated to the load buffer speculatively. The preceding indirect branch is still in the branch execution unit and waiting to complete.

I believe the premise is that the load needs to be executed (assigned a line fill buffer) before the load buffers are flushed on the misprediction. If it has been assigned a line fill buffer already then nothing can be done. It makes sense that there isn't a mechanism to cancel a line fill buffer allocation because the line fill buffer would have to pend before storing to the cache after returning it to the load buffer. This could cause line fill buffers to become saturated because instead of deallocating when required (keeping it in there for speed of other loads to the same address but deallocating when the there are no other available line buffers). It would not be able to deallocate until it receives some signal that a flush is not going to occur, meaning it has to halt for the previous branch to execute instead of immediately making the line fill buffer available for the stores of the other logical core. This signalling mechanism could be difficult to implement and perhaps it didn't cross their minds (pre-Spectre thinking) and it would also introduce delay in the event that branch execution takes enough time for hanging line fill buffers to cause a performance impact i.e. if data.size is purposefully flushed from the cache (CLFLUSH) before the final iteration of the loop meaning branch execution could take up to 100 cycles.

I hope my thinking is correct but I'm not 100% sure. If anyone has anything to add or correct then please do.

Sometimes the term "BTB" is used collectively to refer to all of the buffers used by the branch prediction unit. However, there are actually multiple buffers all of which are used in every cycle to make target and direction predictions. In particular, the BTB is used to make predictions for direct branches, the ITB (indirect target buffer) is used to make predictions for indirect branches except for returns, and the RSB is used to make predictions for returns. The ITB is also called the IBTB or Indirect Target Array. All of these terms are used by different vendors and researchers. Typically, the BTB is used to make initial predictions for all kinds of branch instructions when the other buffers miss. But later the predictor learns more about the branches and the other buffers come into play. If multiple dynamic instances of the same indirect branch have all the same target, then the BTB might also be used instead of the ITB. The ITB is much more accurate when the same branch has multiple targets and is designed specifically to deal with such branches. See: Branch prediction and the performance of interpreters — Don't trust folklore. The first Intel processor that implemented separate BTB and ITB structures is the Pentium M. All later Intel Core processors have dedicated ITBs.

The Spectre V1 exploit is based on training the BTB using an attacker program so that when the victim executes a branch that aliases the same BTB entry, the processor is tricked into speculatively executing instructions (called the gadget) to leak information. The Spectre V2 exploit is similar but is based on training the ITB instead. The crucial difference here is that in V1, the processor mispredicts the direction of the branch, while in V2, the processor mispredicts the target of the branch (and, in case of a conditional indirect branch, the direction as well because we want it to be taken). In programs that are interpreted, JIT-compiled, or make use of dynamic polymorphism, there can be many indirect branches (other than returns). A particular indirect branch may never be intended to go to some location, but by mistraining the predictor, it can be made to jump anywhere we want. It is exactly for this reason why V2 is very powerful; no matter where the gadget is and no matter what the intentional control flows of the program are, you can pick one of the indirect branches and make it speculatively jump to the gadget.

Note that typically the linear address of the target of a static direct branch remains the same throughout the lifetime of the program. There is only one situation where this may not be the case: dynamic code modification. So at least in theory, a Spectre exploit can be developed based on target misprediction of direct branches.

Regarding reclamation of LFBs, I don't really understand what you're saying. When a load request that missed the L1D receives the data into the LFB, the data is immediately forwarded to the bypass interconnect of the pipeline. There needs to be a way to determine which load uop has requested this data. The data returned must be tagged with the uop ID of the load. The sources of the uops in the RS that are waiting for the data are represented as the uop IDs of the loads. In addition, the ROB entry that holds the load uop needs to be marked as completed so that it can be retired and, in pre-SnB, the returned data needs to be written into the ROB. If, on pipeline flush, an outstanding load request in an LFB is not cancelled, and if the load uop ID got reused for some other uop, when the data arrives, it might be incorrectly forwarded to whatever new uops are currently in the pipeline, thereby corrupting the microarchitectural state. So there needs to be a way to ensure that this does not happen under any circumstances. It's very possible to cancel outstanding load requests and speculative RFOs on a pipeline flush by simple marking all of the valid LFB entries as "cancelled", just so that the data is not returned to the pipeline. However, the data might still be fetched and filled in into one or more levels of caches. Requests in the LFB are identified by line-aligned physical addresses. There can be other possible designs.

I've decided to run an experiment to determine exactly when the LFBs get deallocated on Haswell. Here is how it works:

Outer Loop (10K iterations):

Inner Loop (100 iterations):
10 load instructions to different cache lines most of which miss the L2.
LFENCE.
A sequence of IMULs to delay the resolution of the jump by 18 cycles.
Jump to inner.

3 load instructions to different cache lines.
LFENCE.
Jump to outer.

For this to work, hyperthreading and both L1 prefetchers need to be turned off to ensure that we own all of the 10 LFBs of the L1.

The LFENCE instructions ensure that we don't run out of LFBs when executing on a correctly predicted path. The key idea here is that the inner jump will be mispredicted once per outer iteration, so up to 10 loads of the inner iteration that are on the mispredicted path can be allocated in the LFBs. Note that the LFENCE prevents loads from later iterations to be allocated. After a few cycles, the inner branch will be resolved and a misprediction occurs. The pipeline is cleared and the frontend is resteered to fetch and execute the load instructions in the outer loop.

There are two possible outcomes:

The LFBs that have been allocated for the loads on the mispredicted path are immediately released as part of the pipeline clear operation and made available for other loads. In this case, there will be no stalls due to LFB unavailability (counted using L1D_PEND_MISS.FB_FULL).
The LFBs are released only when the loads get serviced irrespective of whether they were on a mispredicted path.

When there are three loads in the outer loop after the inner jump, the measured value of L1D_PEND_MISS.FB_FULL is about equal to the number of outer iterations. That's one request per outer loop iteration. This means that when the three loads on the correct path get issued to the L1D, the loads from the mispredcited path are still occupying the 8 LFB entries, resulting in an FB full event for the third load. This suggests that loads in the LFBs only get deallcoated when the load actually completes.

If I put less than two loads in the outer loop, there will be basically no FB full events. There is one thing I noticed: for every additional load in the outer loop beyond three loads, the L1D_PEND_MISS.FB_FULL gets increased by about 20K instead of the expected 10K. I think what's happening is that when a load request of a load uop gets issued to the L1D for the first time and the all LFBs are in use, it gets rejected. Then when an LFB becomes available, two loads pending in the load buffer are sent to the L1D, one will be allocated in the LFB and the other will get rejected. So we get two LFB full events per additional load. However, when there are three loads in the outer loop, only the third one would be waiting for an LFB, so we get one event per outer loop iteration. Essentially, the load buffer cannot distinguish between having one LFB available or two LFBs; it only gets to know that at least one LFB is free and so it attempts to send two load requests at the same time since there are two load ports.

For branches, some are like jc .somewhere where the CPU only really needs to guess if the branch will be taken or not taken to be able to speculate down the guessed path. However, some branches are like jmp [table+eax*8] where there can be over 4 billion possible directions, and for those cases the CPU needs to guess the target address to be able to speculate down the guessed path. Because there's very different types of branches, the CPU uses very different types of predictors.

For Spectre, there's a "meta pattern" - the attacker uses speculative execution to trick CPU into leaving information in something, then extracts that information from the something. There are multiple possibilities for "something" (data caches, instruction caches, TLBs, branch target buffer, branch direction buffer, return stack, write-combining buffers, ...) and therefore there's are many possible variations of spectre (and not just the "well known first two variations" that were made public in early 2018).

For spectre v1 (where "something" is a data cache) the attacker needs some way to trick the CPU into putting data into the data cache (e.g. a load and then a second load that depends on the value from the first load, which can be executed speculatively) and some way to extract the information (flush everything in the cache, then use the amount of time that a load takes to determine how the state of the data cache changed).

For spectre v2 (where "something" is the branch direction buffer that's used for instructions like jc .somewhere) the attacker needs some way to trick the CPU into putting data into the branch direction buffer (e.g. a load and then a branch that depends on the load, which can be executed speculatively) and some way to extract the information (set the branch direction buffer to a known state beforehand, then use the amount of time that a branch takes to determine how the state of the branch direction buffer changed).

For all of the many possible variations of spectre, the only important thing (for defense) is what the "something" can be (and how to prevent information from getting into the "something", or flush/overwrite/destroy information that got into the "something"). Everything else (specific details of one of the many possible implementations of code to attack any one of the many possible spectre variations) is unimportant.

Vague History Of Spectre

The original Spectre (v1, using cache timing) was found in 2017 and publicly announced in January 2018. It was like a dam bursting, and a few other variants (e.g. v2, using branch prediction) quickly followed. These early variations grabbed a lot of publicity. In the ~6 months or so after that multiple other variants were found, but didn't get as much publicity and a lot of people weren't (and still aren't) aware of them. By the "latter half" of 2018 people (e.g. me) started losing track of which variants were proven (via. "proof of concept" implementations) and which were still unproven, and some researchers started trying to enumerate the possibilities and establish naming conventions for them. The best example of this that I've seen so far is "A Systematic Evaluation of Transient Execution Attacks and Defenses" (see https://arxiv.org/pdf/1811.05441.pdf ).

However, the "hole in the dam wall" isn't something that can be plugged easily, and (for random guesses) I think it's going to take several years before we can assume all possibilities have been explored (and I think the need for mitigation will never go away).

Thanks Brendan and Hadi Brais, after reading your answers and finally reading the spectre paper it is clear now where I was going wrong in my thinking and I confused the two a little.

I was partially describing Spectre v1 which causes a bounds check bypass by mistraining the branch history of a jump i.e. if (x < array1_size) to a spectre gadget. This is obviously not an indirect branch. The hacker does this by invoking a function containing the spectre gadget with legal parameters to prime the branch predictor (PHT+BHT) and then invoke with illegal parameters to bring array1[x] into cache. They then reprime the branch history by supplying legal parameters and then flush array1_size from cache (which I'm not sure how they do because even if the attacker process knows the VA of array1_size, the line cannot be flushed because the TLB contains a different PCID for the process, so it must be caused to be evicted in some way i.e. filling the set at that virtual address). They then invoke with the same illegal parameters as before and as array1[x] is in cache but array1_size is not, array[x] will resolve quickly and begin the load of array2[array1[x]] while still waiting on array1_size, which loads a position in array2 based on the secret at any x that transcends the bounds of array1. The attacker then recalls the function with a valid value of x and times the function call (I assume the attacker must know the contents of array1 because if array2[array1[8]] results in a faster access they need to know what is at array1[8] as that is the secret, but surely that array would have to contain every 2^8 bit combination right).

Spectre v2 on the other hand requires a second attack process that knows the virtual address of an indirect branch in the victim process so that it can poison the target and replace it with another address. If the attack process contains a jump instruction that would reside in the same set, way and tag in the IBTB as the victim indirect branch would then it just trains that branch instruction to predict taken and jump to a virtual address which happens to be that of the gadget in the victim process. When the victim process encounters the indirect branch the wrong target address from the attack program is in the IBTB. It is crucial that it is an indirect branch because falsities as a result of a process switch are usually checked at decode i.e. if the branch target differs from the target in the BTB for that RIP then it flushes the instructions fetched before it. This cannot be done with indirect branches because it does not know the target until the execution stage and hence the idea is that the indirect branch selected depends on a value that needs to be fetched from cache. It then jumps to this target address which is that of the gadget and so on and so forth.

The attacker needs to know the source code of the victim process to identify a gadget and they need to know the VA at which it will reside. I assume this could be done by knowing predictably where code will be loaded. I believe .exes are typically loaded at x00400000 for instance and then there is a BaseOfCode in the PE header.

Edit: I just read Appendix B of the spectre paper and it makes for a nice Windows implementation of Spectre v2.

As a proof-of-concept, we constructed a simple target application which provides the service of computing a SHA1 hash of a key and an input message. This implementation consisted of a program which continuously runs a loop which calls Sleep(0), loads the input from a file, invokes the Windows cryptography functions to compute the hash, and prints the hash whenever the input changes. We found that the Sleep() call is done with data from the input file in registers ebx, edi, and an attacker-known value for edx, i.e., the content of two registers is controlled by the attacker. This is the input criteria for the type of Spectre gadget described in the beginning of this section.

It uses ntdll.dll (.dll full of native API system call stubs) and kernel32.dll (Windows API) which are always mapped in the user virtual address space on the direction of ASLR (specified in the .dll images), except the physical address is likely to be the same due to copy-on-write view mapping into the page cache. The indirect branch to poison will be in the Windows API Sleep() function in kernel32.dll which appears to indirectly call NtDelayExecution() in ntdll.dll. The attacker then ascertains the address of the indirect branch instruction and maps a page encompassing the victim address that contains the target address into its own address space and changes the target address stored at that address to that of the gadget that they identified to be residing somewhere in the same or another function in ntdll.dll (I'm not entirely sure (due to ASLR) how the attacker knows for certain where the victim process maps kernel32.dll and ntdll.dll in its address space in order to locate the address of the indirect branch in Sleep() for the victim. Appendix B claims they used 'Simple pointer operations' to locate the indirect branch and address that contains the target -- how that works I'm not sure). Threads are then launched with the same affinity of the victim (so that the victim and mistraining threads hyperthread on the same physical core) which call Sleep() themselves to train it indirectly which in the address space context of the hack process will now jump to the address of the gadget. The gadget is temporarily replaced with a ret so that it returns from Sleep() smoothly. These threads will also execute a sequence before the indirect jump to mimic what the global branch history of the victim would be before encountering the indirect jump to fully ensure that the branch is taken in an alloyed history. A separate thread is then launched with the complement of the thread affinity of the victim that repeatedly evicts the victim's memory address containing the jump destination to ensure that when the victim does encounter the indirect branch it will take a long RAM access to resolve which allows the gadget to speculate ahead before the branch destination can be checked against the BTB entry and the pipeline flushed. In JavaScript, eviction is done by loading to the same cache set i.e. in multiples of 4096. The mistraining threads, eviction threads and victim threads are all running and looping at this stage. When the victim process loop calls Sleep(), the indirect branch speculates to the gadget due to the IBTB entry that the hacker poisoned previously. A probing thread is launched with the complement of the victim process thread affinity (so as not to interfere with the mistraining and victim branch history). The probing thread will modify the header of the file that the victim process uses which results in those values residing in ebx and edi when Sleep() is called meaning the probing thread can directly influence the values stored in ebx and edi. The spectre gadget branched to in the example adds the value stored at [ebx+edx+13BE13BDh] to edi and then loads a value at the address stored in edi and adds it with a carry to dl. This allows the probing thread to learn the value stored at [ebx+edx+13BE13BDh] as if it selects an original edi of 0 then the value accessed in the second operation will be loaded from the virtual address range 0x0 – 0x255 by which time the indirect branch will resolve but the side effects are already present. The attack process needs to ensure that it has mapped the same physical address into the same location in its virtual address space in order to probe the probing array with a timing attack. Not sure how it does this but in windows it would, AFAIK, need to map a view of a page-file–backed section object that has been opened by the victim at that location. Either that or it would manipulate the victim to call the spectre gadget with a negative TC ebx value such that ebx+edx+13BE13BDh = 0, =1,..., =255 and somehow time that call. This could potentially also be achieved by using APC injection.

来源：https://stackoverflow.com/questions/54541157/the-inner-workings-of-spectre-v2

标签

x86

intel

cpu-architecture

branch-prediction

spectre