Which system characteristics such as number of cores of cache configurations can I change when restoring a checkpoint in gem5?

问题

This is an important non-trivial question that comes up from time to time, so I would like to centralize its discussion.

If some point needs further discussion, let's ask about it specifically on a separate question.

回答1:

What you always have to think is: some software is running (e.g. Linux kernel), and it might be storing state in memory that describes the hardware.

Therefore, if I make a sudden change to the underlying hardware during restore, could that lead the software to blow up because it was expecting a different hardware based on previous information it gathered in memory or its registers?

As a general principle, the more "microarchitectural" something is, the less the software running is likely to see it and blow up due to it changing.

So to more specifically address the most common cases:

CPU type: CPU types such as AtomicSimpleCPU, MinorCPU and DerivO3CPU are basically microarchitectural descriptions, and switching betweem them is well supported. There are even pre-commit tests that assert that this functionality work: search for switcheroo tests under tests/config e.g. in gem5 5ae5fa85d7eb51f4dafdef7e27316d6fc84dedc1.
caches: gem5's classic memory system does not save any cache state, so that the user does not become stuck to a pre-determined cache hierarchy configuration when restoring those checkpoints. Also, when creating checkpoints the simulation must be run without caches, so that the simulator can skip in-depth cache handling. Therefore, when restoring checkpoints, any combination of cache sizes, levels and connections is possible. However, since the caches will be restored in empty state, it is advised to let simulation warmup before stats start to be taken.

Furthermore, cache sizes are currently not even exposed to guest at all it seems: Why doesn't the Linux kernel see the cache sizes in the gem5 emulator in full system mode? so it is one less thing that can go wrong. If they were though, software that tunes itself depending on cache sizes could be tuning based on a previously read version, and run slower than expected, you would have to understand that software and ensure that this does not happen, i.e. make sure the software reads cache sizes after the restore.

number of CPUs: I'm pretty sure that the Linux kernel checks the number of CPUs and initializes them early on, and so your software won't be able to use an extra CPUs added. For example, a aarch64 Linux 5.4.3 boot logs relatively early in the boot the secondary core initialization:

<6>[    0.051463] smp: Bringing up secondary CPUs ...                                                                                   
<6>[    0.055387] Detected PIPT I-cache on CPU1                                                                                                                                                                                                                                 
<6>[    0.056322] CPU1: Booted secondary processor 0x0000000001 [0x000f0510]                           
<6>[    0.062014] Detected PIPT I-cache on CPU2                                                                                         
<6>[    0.062172] CPU2: Booted secondary processor 0x0000000002 [0x000f0510]                                                            
<6>[    0.065890] Detected PIPT I-cache on CPU3                                                                                         
<6>[    0.066051] CPU3: Booted secondary processor 0x0000000003 [0x000f0510]                                                            
<6>[    0.066689] smp: Brought up 1 node, 4 CPUs                                  contains                                                      
<6>[    0.066771] SMP: Total of 4 processors activated.

I'm not sure if gem5 itself can handle adding more cores, but I ran a simple example and it did not blow up immediately. So maybe if you were able to force the kernel to re-check for CPUs, it would work.

I would also look into CPU-hotplugging capabilities which the kernel definitely has, but which I would bet gem5 does not implement. If everything were perfectly aligned, it would be in theory possible to have a smart restore mechanism that calls hotplug mechanisms at restore time.

As a related issue, I have heard that certain setups don't support the taking of checkpoints because they can't properly drain state: this was the case for one of the Ruby protocols, but I don't remember which right now.

Performance counters are another a slightly interesting case that comes to mind as a way to leak micro architecture, but generally software won't blow up due to unexpected values of performance counters, and those counters are meant to be reset before the region of interest anyways.

As a rule of thumb, when in doubt whether a simulation object can be changed, look within its code (and its base classes') for the overload of the serialize() function. This function, along with its unserialize() counterpart are responsible for determining which architectural state is saved and restored when a checkpoint is taken.

来源：https://stackoverflow.com/questions/60876259/which-system-characteristics-such-as-number-of-cores-of-cache-configurations-can

标签

gem5