Disable CPU caches (L1/L2) on ARMv8-A Linux

问题

I want to disable the low level cache on an ARMv8-A platform running Linux, in order to measure performance of optimized code, independent of cache access.

For Intel systems I found the following resource (Is there a way to disable CPU cache (L1/L2) on a Linux system?), but I can not directly be applied directly due to a different instruction set.

So far I have a kernel module which alters the corresponding system register to disable instruction and data cache.

#include <linux/module.h>

int init_module(void)
{
  int64_t value;

  asm volatile("\
    MRS %0, SCTLR_EL1     // Read SCTLR_EL1 into Xt\n\
    BIC %0, %0, (1<<2)    // clear bit 2, SCTLR_EL1.C\n\
    BIC %0, %0, (1<<12)   // clear bit 12, SCTLR_EL1.I\n\
    MSR SCTLR_EL1, %0     // Write Xt to SCTLR_EL1\n\
  " : "+r" (value));

  return 0;
}

void cleanup_module(void)
{
  int64_t value;

  asm volatile("\
    MRS %0, SCTLR_EL1    // Read SCTLR_EL1 into Xt\n\
    ORR %0, %0, (1<<2)   // set bit 2, SCTLR_EL1.C\n\
    ORR %0, %0, (1<<12)  // set bit 12, SCTLR_EL1.I\n\
    MSR SCTLR_EL1, %0    // Write Xt to SCTLR_EL1\n\
  ": "+r" (value));
}

MODULE_LICENSE("GPL");

However it results in a complete system freeze when loaded (when I set the corresponding bits in the system register). My guess is that I still need some kind of cache clear, but I didn't find anything useful in the ARM manuals.

Anyone has some helpful hints how I could succeed in disabling the cache on ARM or what I am missing here? Thanks.

回答1:

In general, this is unworkable, for several reasons.

Firstly, clearing the SCTLR.C bit only makes all data accesses non-cacheable. and prevents allocating into any caches. Any data in the caches is still there in the caches, especially dirty lines from anything recently-written; consider what happens when your function returns and the caller tries to restore a stack frame which doesn't even exist in the memory it's now accessing.

Secondly, there are very few uniprocessor ARMv8 systems; assuming you're running SMP Linux, and suddenly disable the caches on just whichever CPU the module loader happened to be scheduled on, then even disregarding the first point things are going to go downhill very fast. Linux expects all CPUs to be coherent with each other, and will typically become very broken very rapidly if that assumption is violated. Note that it's not even worth venturing into SMP cross-calling for this; suffice to say the only safe way to even attempt to run Linux with caches disabled is to make sure they are never enabled to begin with, except...

Thirdly, there is no guarantee Linux will even run with caches disabled. On current hardware, all of the locking and atomic operations in the kernel (not to mention userspace) rely on the exclusive access instructions. Whilst the CPU cluster(s) will implement the architecturally-required local and global exclusive monitors for cacheable memory (usually as part of the cache machinery itself), it is dependent on the system whether a global exclusive monitor for non-cacheable accesses is implemented, as such a thing must be external to the CPU (usually in the interconnect or memory controller). Many systems don't implement such a global monitor, in which case exclusive accesses to external memory may fault, do nothing, or other various implementation-defined behaviours which will result in Linux crashing or deadlocking. It is effectively impossible to run Linux with the cache off on such a system - the amount of hacking just to get a UP arm64 kernel to work (SMP would be literally impossible) would be impractical alone; good luck with userspace.

As it happens, though, the worst problem is none of that, it's this:

...in order to measure performance of optimized code, independent of cache access.

If the code is intended to run in deployment with caches disabled, then logically it can't be intended to run under Linux, therefore the effort spent in hacking up Linux would be better spent on benchmarking in a more realistic execution environment so that results are actually representative. On the other hand, if it is intended to run with caches enabled (under Linux or any other OS), then benchmarking with caches disabled will give meaningless results and be a waste of time. "Optimising" for e.g. an instruction-fetch-bandwidth bottleneck which won't exist in practice is not going to lead you in the right direction.

回答2:

I have done it on armv8-a linux. I do it not to measure performance but to verify that xilinx zcu104 platform might have potential coherence error. As a result, the pynq image xilinx provided must have some coherence error during pl and ps communication. Here is my workaround:

My platform is cortex-a53, ubuntu18 started at EL2 and switched to EL1 and supported SMP on four cpu cores. Thus, I need to turn off multi-core to ensure L2-cache coherency. Thanks to the feature of cpu-hot-plug, I just run :

echo '0' > /sys/devices/system/cpu/cpu1/online,

echo '0' > /sys/devices/system/cpu/cpu2/online,

echo '0' > /sys/devices/system/cpu/cpu3/online

then I run dmesg to verify that multi-core has been turn off.
I build the kernel source tree, cause I cannot find it in my linux. You can run uname -r to see your kernel version. And find in /usr/src to see whether your linux already have it.
I build the linux module. With gcc inline asm , I flush all cache and set sctlr_el1.c 0.
I insmod the module. And I first run my program with the right result, although it takes 20 times slower than multi-core and D-cache on.

来源：https://stackoverflow.com/questions/41227527/disable-cpu-caches-l1-l2-on-armv8-a-linux

标签

Linux

caching

arm