问题
I know that the ARM PMU is partially implemented, thanks to the gem5 source code and some publications.
I have a binary which uses perf_event to access the PMU on a Linux-based OS, under an ARM processor. Could it use perf_event inside a gem5 full-system simulation with a Linux kernel, under the ARM ISA?
So far, I haven't found the right way to do it. If someone knows, I will be very grateful!
回答1:
As of September 2020, gem5 needs to be patched in order to use the ARM PMU.
Edit: As of November 2020, gem5 is now patched and it will be included in the next release. Thanks to the developers!
How to patch gem5
This is not a clean patch (very straightforward), and it is more intended to understand how it works. Nonetheless, this is the patch to apply with git apply
from the gem5 source repository:
diff --git i/src/arch/arm/ArmISA.py w/src/arch/arm/ArmISA.py
index 2641ec3fb..3d85c1b75 100644
--- i/src/arch/arm/ArmISA.py
+++ w/src/arch/arm/ArmISA.py
@@ -36,6 +36,7 @@
from m5.params import *
from m5.proxy import *
+from m5.SimObject import SimObject
from m5.objects.ArmPMU import ArmPMU
from m5.objects.ArmSystem import SveVectorLength
from m5.objects.BaseISA import BaseISA
@@ -49,6 +50,8 @@ class ArmISA(BaseISA):
cxx_class = 'ArmISA::ISA'
cxx_header = "arch/arm/isa.hh"
+ generateDeviceTree = SimObject.recurseDeviceTree
+
system = Param.System(Parent.any, "System this ISA object belongs to")
pmu = Param.ArmPMU(NULL, "Performance Monitoring Unit")
diff --git i/src/arch/arm/ArmPMU.py w/src/arch/arm/ArmPMU.py
index 047e908b3..58553fbf9 100644
--- i/src/arch/arm/ArmPMU.py
+++ w/src/arch/arm/ArmPMU.py
@@ -40,6 +40,7 @@ from m5.params import *
from m5.params import isNullPointer
from m5.proxy import *
from m5.objects.Gic import ArmInterruptPin
+from m5.util.fdthelper import *
class ProbeEvent(object):
def __init__(self, pmu, _eventId, obj, *listOfNames):
@@ -76,6 +77,17 @@ class ArmPMU(SimObject):
_events = None
+ def generateDeviceTree(self, state):
+ node = FdtNode("pmu")
+ node.appendCompatible("arm,armv8-pmuv3")
+ # gem5 uses GIC controller interrupt notation, where PPI interrupts
+ # start to 16. However, the Linux kernel start from 0, and used a tag
+ # (set to 1) to indicate the PPI interrupt type.
+ node.append(FdtPropertyWords("interrupts", [
+ 1, int(self.interrupt.num) - 16, 0xf04
+ ]))
+ yield node
+
def addEvent(self, newObject):
if not (isinstance(newObject, ProbeEvent)
or isinstance(newObject, SoftwareIncrement)):
diff --git i/src/cpu/BaseCPU.py w/src/cpu/BaseCPU.py
index ab70d1d7f..66a49a038 100644
--- i/src/cpu/BaseCPU.py
+++ w/src/cpu/BaseCPU.py
@@ -302,6 +302,11 @@ class BaseCPU(ClockedObject):
node.appendPhandle(phandle_key)
cpus_node.append(node)
+ # Generate nodes from the BaseCPU children (and don't add them as
+ # subnode). Please note: this is mainly needed for the ISA class.
+ for child_node in self.recurseDeviceTree(state):
+ yield child_node
+
yield cpus_node
def __init__(self, **kwargs):
What the patch resolves
The Linux kernel uses a Device Tree Blob (DTB), which is a regular file, to declare the hardware on which the kernel is running. This is used to make the kernel portable between different architecture without a recompilation for each hardware change. The DTB follows the Device Tree Reference, and is compiled from a Device Tree Source (DTS) file, a regular text file. You can learn more here and here.
The problem was that the PMU is supposed to be declared to the Linux kernel via the DTB. You can learn more here and here. In a simulated system, because the system is specified by the user, gem5 has to generate a DTB itself to pass to the kernel, so the latter can recognize the simulated hardware. However, the problem is that gem5 does not generate the DTB entry for our PMU.
What the patch does
The patch adds an entry to the ISA and the CPU files to enable DTB generation recursion up to find the PMU. The hierarchy is the following: CPU => ISA => PMU. Then, it adds the generation function in the PMU to generate a unique DTB entry to declare the PMU, with the proper notation for the interrupt declaration in the kernel.
After running a simulation with our patch, we could see the DTS from the DTB like this:
cd m5out
# Decompile the DTB to get the DTS.
dtc -I dtb -O dts system.dtb > system.dts
# Find the PMU entry.
head system.dts
dtc
is the Device Tree Compiler, installed with sudo apt-get install device-tree-compiler
. We end up with this pmu
DTB entry, under the root node (/
):
/dts-v1/;
/ {
#address-cells = <0x02>;
#size-cells = <0x02>;
interrupt-parent = <0x05>;
compatible = "arm,vexpress";
model = "V2P-CA15";
arm,hbi = <0x00>;
arm,vexpress,site = <0x0f>;
memory@80000000 {
device_type = "memory";
reg = <0x00 0x80000000 0x01 0x00>;
};
pmu {
compatible = "arm,armv8-pmuv3";
interrupts = <0x01 0x04 0xf04>;
};
cpus {
#address-cells = <0x01>;
#size-cells = <0x00>;
cpu@0 {
device_type = "cpu";
compatible = "gem5,arm-cpu";
[...]
In the line interrupts = <0x01 0x04 0xf04>;
, 0x01
is used to indicate that the number 0x04
is the number of a PPI interrupt (the one declared with number 20
in gem5, the difference of 16
is explained inside the patch code). The 0xf04
corresponds to a flag (0x4
) indicating that it is a "active high level-sensitive" interrupt and a bit mask (0xf
) indicating that the interrupts should be wired to all PE attached to the GIC. You can learn more here.
If the patch works and your ArmPMU
is declared properly, you should see this message at boot time:
[ 0.239967] hw perfevents: enabled with armv8_pmuv3 PMU driver, 32 counters available
回答2:
Context
I was not able to use the Performance Monitoring Unit (PMU) because of a gem5's unimplemented feature. The reference on the mailing list can be found here. After a personal patch, the PMU is accessible through perf_event
. Fortunately, a similar patch will be released in the official gem5 release soon, could be seen here. The patch will be described in another answer, due to the number of link limitation inside one message.
How to use the PMU
C source code
This is a minimal working example of a C source code using perf_event
, used to count the number of mispredicted branches by the branch predictor unit during a specific task:
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/perf_event.h>
int main(int argc, char **argv) {
/* File descriptor used to read mispredicted branches counter. */
static int perf_fd_branch_miss;
/* Initialize our perf_event_attr, representing one counter to be read. */
static struct perf_event_attr attr_branch_miss;
attr_branch_miss.size = sizeof(attr_branch_miss);
attr_branch_miss.exclude_kernel = 1;
attr_branch_miss.exclude_hv = 1;
attr_branch_miss.exclude_callchain_kernel = 1;
/* On a real system, you can do like this: */
attr_branch_miss.type = PERF_TYPE_HARDWARE;
attr_branch_miss.config = PERF_COUNT_HW_BRANCH_MISSES;
/* On a gem5 system, you have to do like this: */
attr_branch_miss.type = PERF_TYPE_RAW;
attr_branch_miss.config = 0x10;
/* Open the file descriptor corresponding to this counter. The counter
should start at this moment. */
if ((perf_fd_branch_miss = syscall(__NR_perf_event_open, &attr_branch_miss, 0, -1, -1, 0)) == -1)
fprintf(stderr, "perf_event_open fail %d %d: %s\n", perf_fd_branch_miss, errno, strerror(errno));
/* Workload here, that means our specific task to profile. */
/* Get and close the performance counters. */
uint64_t counter_branch_miss = 0;
read(perf_fd_branch_miss, &counter_branch_miss, sizeof(counter_branch_miss));
close(perf_fd_branch_miss);
/* Display the result. */
printf("Number of mispredicted branches: %d\n", counter_branch_miss);
}
I will not enter into the details of how using perf_event
, good resources are available here, here, here, here. However, just a few notes about the code above:
- On real hardware, when using
perf_event
and common events (events that are available under a lot of architectures), it is recommended to useperf_event
macrosPERF_TYPE_HARDWARE
as type and to use macros likePERF_COUNT_HW_BRANCH_MISSES
for the number of mispredicted branches,PERF_COUNT_HW_CACHE_MISSES
for the number of cache misses, and so on (see the manual page for a list). This is a best practice to have a portable code. - On a gem5 simulated system, currently (v20.0), a C source code have to use
PERF_TYPE_RAW
type and architectural event ID to identify an event. Here, 0x10 is the ID of the0x0010, BR_MIS_PRED, Mispredicted or not predicted branch
event, described in the ARMv8-A Reference Manual (here). In the manual, all events available in real hardware are described. However, they are not all implemented into gem5. To see the list of implemented event inside gem5, refer to thesrc/arch/arm/ArmPMU.py
file. In the latter, the lineself.addEvent(ProbeEvent(self,0x10, bpred, "Misses"))
corresponds to the declaration of the counter described in the manual. This is not a normal behavior, hence gem5 should be patched to allow usingPERF_TYPE_HARDWARE
one day.
gem5 simulation script
This is not a entire MWE script (it would be too long!), only the needed portion to add inside a full-system script to use the PMU. We use an ArmSystem as a system, with the RealView platform.
For each ISA (we use an ARM ISA here) of each CPU (e.g., a DerivO3CPU
) in our cluster (which is a SubSystem
class), we add to it a PMU with a unique interrupt number and the already implemented architectural event. An example of this function could be found in configs/example/arm/devices.py
.
To choose an interrupt number, pick a free PPI interrupt in the platform interrupt mapping. Here, we choose PPI n°20, according to the RealView
interrupt map (src/dev/arm/RealView.py
). Since PPIs interrupts are local per Processing Element (PE, corresponds to cores in our context), the interrupt number can be the same for all PE without any conflict. To know more about PPI interrupts, see the GIC guide from ARM here.
Here, we can see that the interrupt n°20 is not used by the system (from RealView.py
):
Interrupts:
0- 15: Software generated interrupts (SGIs)
16- 31: On-chip private peripherals (PPIs)
25 : vgic
26 : generic_timer (hyp)
27 : generic_timer (virt)
28 : Reserved (Legacy FIQ)
We pass to addArchEvents
our system components (dtb
, itb
, etc.) to link the PMU with them, thus the PMU will use the internal counters (called probes) of these components as exposed counters to the system.
for cpu in system.cpu_cluster.cpus:
for isa in cpu.isa:
isa.pmu = ArmPMU(interrupt=ArmPPI(num=20))
# Add the implemented architectural events of gem5. We can
# discover which events is implemented by looking at the file
# "ArmPMU.py".
isa.pmu.addArchEvents(
cpu=cpu, dtb=cpu.dtb, itb=cpu.itb,
icache=getattr(cpu, "icache", None),
dcache=getattr(cpu, "dcache", None),
l2cache=getattr(system.cpu_cluster, "l2", None))
回答3:
Two quick additions to Pierre's awesome answers:
for fs.py as of gem5 937241101fae2cd0755c43c33bab2537b47596a2, all that is missing is to apply to fs.py as shown at: https://gem5-review.googlesource.com/c/public/gem5/+/37978/1/configs/example/fs.py
for cpu in test_sys.cpu: if buildEnv['TARGET_ISA'] in "arm": for isa in cpu.isa: isa.pmu = ArmPMU(interrupt=ArmPPI(num=20)) isa.pmu.addArchEvents( cpu=cpu, dtb=cpu.mmu.dtb, itb=cpu.mmu.itb, icache=getattr(cpu, "icache", None), dcache=getattr(cpu, "dcache", None), l2cache=getattr(test_sys, "l2", None))
a C example can also be found in
man perf_event_open
来源:https://stackoverflow.com/questions/63988672/using-perf-event-with-the-arm-pmu-inside-gem5