问题
I haven't found a clear answer: does the control unit itself fetch pre-defined instructions to execute a cache eviction, or does the operating system intervene? If so, how?
回答1:
Which part of the computer manages cache replacement?
Typically; a cache manages cache replacement itself (its not done by a separate part).
There are many types of caches where some are implemented by software (DNS cache, web page cache, file data cache) and some are implemented in hardware (instruction caches, data caches, translation look-aside buffers).
For all cases; whenever new data needs to be inserted into the cache and there isn't enough space, other data needs to be evicted quickly to make space for the new data. Ideally, "least likely to be needed soon" data should be evicted, but that's too hard determine so most caches make the (potentially incorrect) assumption that "least recently used" is a good predictor of "least likely to be needed soon".
Typically this means storing some kind of "time when last used" along with the data (for each item in the cache); which means (for performance) typically "least recently used" (and eviction itself) is built directly into the design of the cache (e.g. the "time when last used" information is stored in a "cache tag" along with other meta-data).
回答2:
Hardware caches manage their own replacement, typically with a pseudo-LRU approach to choosing which way of a set to evict. (True LRU takes too many bits for state, especially with 8-way or more associative.) See also http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ - large slower caches (like L3 cache in modern Intel CPUs) may use an adaptive replacement policy to try to keep some valuable lines even when there are tons of cache misses from a huge working set that doesn't have much future value.
If we consider what it might look like for an OS to have a hand in managing the hardware caches, we quickly see how insane it would be just to implement at all (can the handler access memory? What if it needs to replace a line in a set?) And that performance would be a disaster, as well as implementation complexity. From this reasoning, we can see why dedicated logic gates are built right in to the same cache checking and update hardware.
Trapping to the OS on every cache miss would make cache misses much more costly. Some trigger a lot of cache replacement, e.g. looping over large arrays where most accesses miss in at least first level cache (if you aren't doing enough computation for HW prefetch to stay ahead). It would also hurt memory-level parallelism (multiple cache misses in flight at once) which is very important for hiding the large memory latency. I guess if you just choose a line to evict, the handler can return without actually waiting for the cache miss itself to resolve, so you could possibly have it run again while another cache miss was still in flight. But memory-ordering rules would make this sketchy: for example some ISAs guarantee that loads will appear to have happened in program order.
Trapping to an OS's handler would flush the pipeline on most normal CPUs.
Also, HW prefetch: it's important for hardware to be able to speculatively read ahead of where a stream of loads is currently reading. That way when the actual demand load happens, it can hopefully hit in L2 or even L1d cache. (If replacement in the real cache had to be managed by the OS, you'd need some separate prefetch buffer for this, that the OS could read from? Insane levels of complexity if you want prefetching to work, but it's necessary for correctness).
Besides, what's the OS going to do? Run instructions that load data to figure out which line to replace? What if those loads/stores create more cache misses.
Also: stores don't truly commit to L1d cache until after they retire from the out-of-order back end, in an OoO exec CPU. i.e. until after they're known to be non-speculative. (The store buffer is what allows this decoupling). At this point there's no way to roll them back; they definitely need to happen. If you have multiple cache-miss stores in the store buffer before you detect the first one (or when a cache-miss load happens synchronously), how could a hypothetical cache-miss exception handler do anything without violating the memory model, if it requires store ordering. This seems like a nightmare.
I've been assuming that a "cache miss handler" would be something like a software TLB miss handler (e.g. on MIPS or another ISA that doesn't do hardware page-walks). (In MIPS, the TLB miss exception handler must use memory in a special region that has a fixed translation so can be accessed without itself causing more TLB misses.) The only thing that could make any sense would be for the OS to provide some kind of "microcode" that implements a replacement policy, and the CPU runs it internally when replacement is needed, not in sequence with normal execution of instructions for the main CPU.
But in practice programmable microcode would be way too inefficient; it wouldn't have time to check memory or anything (unless there was persistent cache-speed state reserved for use by this microcode) so. Dedicated hardware can make a decision in a clock cycle or two, with logic wired up directly to the state bits for that cache.
Choice of what state to provide and track is strongly tied to choice of replacement algorithm. So having that be programmable would only make sense if there was more choice, or a lot of state.
LRU requires updating state tracking on cache hit. Trapping to the OS to let it choose how to update things on every cache hit is obviously not plausible for acceptable performance; every memory access would trap.
来源:https://stackoverflow.com/questions/65366046/which-part-of-the-computer-manages-cache-replacement