问题
How to count the number of CPU clock cycles between the start and end of a benchmark in gem5?
I'm interested in all of the following cases:
full system userland benchmark. Maybe the
m5
guest tool has a way to do it?bare metal benchmark. When gem5 exits it dumps the stats automatically, so the main question is how to skip the cycles for bootloader and go straight to the benchmark itself.
Is there a way besides modifying the benchmark source with instrumentation instructions? How to write those instrumentation instructions in detail?
syscall emulation benchmark. I think gem5 just outputs the
stats.txt
at the end of the run, and then you ca just grepsystem.cpu.numCycles
, but I have to confirm it, currently blocked on: How to solve "FATAL: kernel too old" when running gem5 in syscall emulation SE mode?
I want to use this to learn:
- learn how CPUs work
- how to optimize assembly code or compiler settings to run optimally on a given CPU
回答1:
m5
tool
A good approximation is to run, ideally from a shell script that is the /init
program:
m5 resetstats
run-benchmark
m5 dumpstats
Then on host:
grep -E '^system.cpu.numCycles ' m5out/stats.txt
Gives something like:
system.cpu.numCycles 33942872680 # number of cpu cycles simulated
Note that if you replay from a m5 checkpoint
with a different CPU, e.g.:
--restore-with-cpu=HPI --caches
then you need to grep for a different identifier:
grep -E '^system.switch_cpus.numCycles ' m5out/stats.txt
resetstats
zeroes out the cumulative stats, and dumpstats
dumps what has been collected during the benchmark.
This is not perfect since there is some time between the exec syscall for m5 dumpstats
finishing and the benchmark starting, but if the benchmark enough, this shouldn't matter.
http://arm.ecs.soton.ac.uk/wp-content/uploads/2016/10/gem5_tutorial.pdf also proposes a few more heuristics:
#!/bin/sh
# Wait for system to calm down
sleep 10
# Take a checkpoint in 100000 ns
m5 checkpoint 100000
# Reset the stats
m5 resetstats
run-benchmark
# Exit the simulation
m5 exit
m5 exit
also works since GEM5 dumps stats when it finishes.
Instrumentation instructions
Sometimes those seem to be just inevitable that you have to modify the input source code a bit with those instructions in order to:
- skip initialization and go directly to steady state
- evaluate individual main loop runs
You can of course deduce those instructions from the gem5 m5
tool code code, but here are some very easy to re-use one line copy pastes for arm and aarch64, e.g. for aarch64:
/* resetstats */
__asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0XFF000110 | (0x40 << 16);" : : : "x0", "x1")
/* dumpstats */
__asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0xFF000110 | (0x41 << 16);" : : : "x0", "x1")
The m5
tool uses the same mechanism under the hood, but by adding the instructions directly into the source, we avoid the syscall, and therefore more precise and representative (at the cost of more manual work).
To ensure that the assembly is not reordered around your ROI by the compiler however, you might want to use the techniques mentioned at: Enforcing statement order in C++
Address monitoring
Another technique that can be used is to monitory addresses of interest instead of adding magic instructions to the source.
E.g., if you know that a benchmark starts with PIC == 0x400
, it should be possible to do something when that addresses is hit.
To find the addresses of interest, you would have for example to use readelf
or gdb
or tracing, and the if running full system on top of Linux, ensure that ASLR is turned off.
This technique would be the least intrusive one, but the setup is harder, and to be honest I haven't done it yet. One day, one day.
来源:https://stackoverflow.com/questions/48944587/how-to-count-the-number-of-cpu-clock-cycles-between-the-start-and-end-of-a-bench