问题
I'm developing (NASM + GCC targetting ELF64) a PoC that uses a spectre gadget that measures the time to access a set of cache lines (FLUSH+RELOAD).
How can I make a reliable spectre gadget?
I believe I understand the theory behind the FLUSH+RELOAD technique, however in practice, despiste some noise, I'm unable to produce a working PoC.
Since I'm using the Timestamp counter and the loads are very regular I use this script to disable the prefetchers, the turbo boost and to fix/stabilize the CPU frequency:
#!/bin/bash
sudo modprobe msr
#Disable turbo
sudo wrmsr -a 0x1a0 0x4000850089
#Disable prefetchers
sudo wrmsr -a 0x1a4 0xf
#Set performance governor
sudo cpupower frequency-set -g performance
#Minimum freq
sudo cpupower frequency-set -d 2.2GHz
#Maximum freq
sudo cpupower frequency-set -u 2.2GHz
I have a continuous buffer, aligned on 4KiB, large enough to span 256 cache lines separated by an integral number GAP of lines.
SECTION .bss ALIGN=4096
buffer: resb 256 * (1 + GAP) * 64
I use this function to flush the 256 lines.
flush_all:
lea rdi, [buffer] ;Start pointer
mov esi, 256 ;How many lines to flush
.flush_loop:
lfence ;Prevent the previous clflush to be reordered after the load
mov eax, [rdi] ;Touch the page
lfence ;Prevent the current clflush to be reordered before the load
clflush [rdi] ;Flush a line
add rdi, (1 + GAP)*64 ;Move to the next line
dec esi
jnz .flush_loop ;Repeat
lfence ;clflush are ordered with respect of fences ..
;.. and lfence is ordered (locally) with respect of all instructions
ret
The function loops through all the lines, touching every page in between (each page more than once) and flushing each line.
Then I use this function to profile the accesses.
profile:
lea rdi, [buffer] ;Pointer to the buffer
mov esi, 256 ;How many lines to test
lea r8, [timings_data] ;Pointer to timings results
mfence ;I'm pretty sure this is useless, but I included it to rule out ..
;.. silly, hard to debug, scenarios
.profile:
mfence
rdtscp
lfence ;Read the TSC in-order (ignoring stores global visibility)
mov ebp, eax ;Read the low DWORD only (this is a short delay)
;PERFORM THE LOADING
mov eax, DWORD [rdi]
rdtscp
lfence ;Again, read the TSC in-order
sub eax, ebp ;Compute the delta
mov DWORD [r8], eax ;Save it
;Advance the loop
add r8, 4 ;Move the results pointer
add rdi, (1 + GAP)*64 ;Move to the next line
dec esi ;Advance the loop
jnz .profile
ret
An MCVE is given in appendix and a repository is available to clone.
When assembled with GAP
set to 0, linked and executed with taskset -c 0
the cycles necessary to fetch each line are shown below.
Only 64 lines are loaded from memory.
The output is stable across different runs.
If I set GAP
to 1 only 32 lines are fetched from memory, ofcourse 64 * (1+0) * 64 = 32 * (1+1) * 64 = 4096, so this may be related to paging?
If a store is executed before the profiling (but after the flush) to one of the first 64 lines, the output changes to this
Any store the the other lines gives the first type of output.
I suspect the math in the is broken but I need another couple of eyes find out where.
EDIT
Hadi Brais pointed out a misuse of a volatile register, after fixing that the output is now inconsistent.
I see prevalently runs where the timings are low (~50 cycles) and sometimes runs where the timing are higher (~130 cycles).
I don't know where the 130 cycles figure come from (too low for memory, too high for the cache?).
Code is fixed in the MCVE (and the repository).
If a store to any of the first lines is executed before the profiling, no change is reflected in the output.
APPENDIX - MCVE
BITS 64
DEFAULT REL
GLOBAL main
EXTERN printf
EXTERN exit
;Space between lines in the buffer
%define GAP 0
SECTION .bss ALIGN=4096
buffer: resb 256 * (1 + GAP) * 64
SECTION .data
timings_data: TIMES 256 dd 0
strNewLine db `\n0x%02x: `, 0
strHalfLine db " ", 0
strTiming db `\e[48;5;16`,
.importance db "0",
db `m\e[38;5;15m%03u\e[0m `, 0
strEnd db `\n\n`, 0
SECTION .text
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .'
; ' ' ' ' ' ' ' ' ' ' '
; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \
;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \
;
;
;FLUSH ALL THE LINES OF A BUFFER FROM THE CACHES
;
;
flush_all:
lea rdi, [buffer] ;Start pointer
mov esi, 256 ;How many lines to flush
.flush_loop:
lfence ;Prevent the previous clflush to be reordered after the load
mov eax, [rdi] ;Touch the page
lfence ;Prevent the current clflush to be reordered before the load
clflush [rdi] ;Flush a line
add rdi, (1 + GAP)*64 ;Move to the next line
dec esi
jnz .flush_loop ;Repeat
lfence ;clflush are ordered with respect of fences ..
;.. and lfence is ordered (locally) with respect of all instructions
ret
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .'
; ' ' ' ' ' ' ' ' ' ' '
; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \
;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \
;
;
;PROFILE THE ACCESS TO EVERY LINE OF THE BUFFER
;
;
profile:
lea rdi, [buffer] ;Pointer to the buffer
mov esi, 256 ;How many lines to test
lea r8, [timings_data] ;Pointer to timings results
mfence ;I'm pretty sure this is useless, but I included it to rule out ..
;.. silly, hard to debug, scenarios
.profile:
mfence
rdtscp
lfence ;Read the TSC in-order (ignoring stores global visibility)
mov ebp, eax ;Read the low DWORD only (this is a short delay)
;PERFORM THE LOADING
mov eax, DWORD [rdi]
rdtscp
lfence ;Again, read the TSC in-order
sub eax, ebp ;Compute the delta
mov DWORD [r8], eax ;Save it
;Advance the loop
add r8, 4 ;Move the results pointer
add rdi, (1 + GAP)*64 ;Move to the next line
dec esi ;Advance the loop
jnz .profile
ret
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .'
; ' ' ' ' ' ' ' ' ' ' '
; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \
;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \
;
;
;SHOW THE RESULTS
;
;
show_results:
lea rbx, [timings_data] ;Pointer to the timings
xor r12, r12 ;Counter (up to 256)
.print_line:
;Format the output
xor eax, eax
mov esi, r12d
lea rdi, [strNewLine] ;Setup for a call to printf
test r12d, 0fh
jz .print ;Test if counter is a multiple of 16
lea rdi, [strHalfLine] ;Setup for a call to printf
test r12d, 07h ;Test if counter is a multiple of 8
jz .print
.print_timing:
;Print
mov esi, DWORD [rbx] ;Timing value
;Compute the color
mov r10d, 60 ;Used to compute the color
mov eax, esi
xor edx, edx
div r10d ;eax = Timing value / 78
;Update the color
add al, '0'
mov edx, '5'
cmp eax, edx
cmova eax, edx
mov BYTE [strTiming.importance], al
xor eax, eax
lea rdi, [strTiming]
call printf WRT ..plt ;Print a 3-digits number
;Advance the loop
inc r12d ;Increment the counter
add rbx, 4 ;Move to the next timing
cmp r12d, 256
jb .print_line ;Advance the loop
xor eax, eax
lea rdi, [strEnd]
call printf WRT ..plt ;Print a new line
ret
.print:
call printf WRT ..plt ;Print a string
jmp .print_timing
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .'
; ' ' ' ' ' ' ' ' ' ' '
; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \
;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \
;
;
;E N T R Y P O I N T
;
;
;'._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .''._ .'
; ' ' ' ' ' ' ' ' ' ' '
; _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \ _' \
;/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \/ \
main:
;Flush all the lines of the buffer
call flush_all
;Test the access times
call profile
;Show the results
call show_results
;Exit
xor edi, edi
call exit WRT ..plt
回答1:
The buffer is allocated from the bss
section and so when the program is loaded, the OS will map all of the buffer
cache lines to the same CoW physical page. After flushing all of the lines, only the accesses to the first 64 lines in the virtual address space miss in all cache levels1 because all2 later accesses are to the same 4K page. That's why the latencies of the first 64 accesses fall in the range of the main memory latency and the latencies of all later accesses are equal to the L1 hit latency3 when GAP
is zero.
When GAP
is 1, every other line of the same physical page is accessed and so the number of main memory accesses (L3 misses) is 32 (half of 64). That is, the first 32 latencies will be in the range of the main memory latency and all later latencies will be L1 hits. Similarly, when GAP
is 63, all accesses are to the same line. Therefore, only the first access will miss all caches.
The solution is to change mov eax, [rdi]
in flush_all
to mov dword [rdi], 0
to ensure that the buffer is allocated in unique physical pages. (The lfence
instructions in flush_all
can be removed because the Intel manual states that clflush
cannot be reordered with writes4.) This guarantees that, after initializing and flushing all lines, all accesses will miss all cache levels (but not the TLB, see: Does clflush also remove TLB entries?).
You can refer to Why are the user-mode L1 store miss events only counted when there is a store initialization loop? for another example where CoW pages can be deceiving.
I suggested in the previous version of this answer to remove the call to flush_all
and use a GAP
value of 63. With these changes, all of the access latencies appeared to be very high and I have incorrectly concluded that all of the accesses are missing all cache levels. Like I said above, with a GAP
value of 63, all of the accesses become to the same cache line, which is actually resident in the L1 cache. However, the reason that all of the latencies were high is because every access was to a different virtual page and the TLB didn't have any of mappings for each of these virtual pages (to the same physical page) because by removing the call to flush_all
, none of the virtual pages were touched before. So the measured latencies represent the TLB miss latency, even though the line being accessed is in the L1 cache.
I also incorrectly claimed in the previous version of this answer that there is an L3 prefetching logic that cannot be disabled through MSR 0x1A4. If a particular prefetcher is turned off by setting its flag in MSR 0x1A4, then it does fully get switched off. Also there are no data prefetchers other than the ones documented by Intel.
Footnotes:
(1) If you don't disable the DCU IP prefetcher, it will actually prefetch back all the lines into the L1 after flushing them, so all accesses will still hit in the L1.
(2) In rare cases, the execution of interrupt handlers or scheduling other threads on the same core may cause some of the lines to be evicted from the L1 and potentially other levels of the cache hierarchy.
(3) Remember that you need to subtract the overhead of the rdtscp
instructions. Note that the measurement method you used actually doesn't enable you to reliably distinguish between an L1 hit and an L2 hit. See: Memory latency measurement with time stamp counter.
(4) The Intel manual doesn't seem to specify whether clflush
is ordered with reads, but it appears to me that it is.
来源:https://stackoverflow.com/questions/50818128/how-can-i-create-a-spectre-gadget-in-practice