问题
This question has an answer that says:
Hyper-threading duplicates internal resources to reduce context switch time. Resources can be: Registers, arithmetic unit, cache.
Why did CPU designers end up with duplication of state resources for simultaneous multithreading (or hyper-threading on Intel)?
Why wouldn't tripling (quadrupling, and so on) those same resources give us three logical cores and, therefore, even faster throughput?
Is duplication that researchers arrived at in some sense optimal, or is it just a reflection of current possibilities (transistor size, etc.)?
回答1:
The answer you're quoting sounds wrong. Hyperthreading competitively shares the existing ALUs, cache, and physical register file.
Running two threads at once on the same core lets it find more parallelism to keep those execution units fed with work instead of sitting idle waiting for cache misses, latency, and branch mispredictions.
Only a few things need to be physically replicated or partitioned to track the architectural state of two CPUs in one core, and it's mostly in the front-end (before the issue/rename stage). David Kanter's Haswell writeup shows how Sandybridge always partitioned the IDQ (decoded-uop queue that feeds the issue/rename stage), but IvyBridge and Haswell can use it as one big queue when only a single thread is active. He also describes how cache is competitively shared between threads. For example, a Haswell core has 168 physical integer registers, but the architectural state of each logical CPU only needs 16. (Out-of-order execution for each thread of course benefits from lots of registers, that's why register renaming onto a big physical register file is done in the first place.)
Modern Intel CPUs have so many execution units that you can only barely saturate them with carefully tuned code that doesn't have any stalls and runs 4 fused-domain uops per clock. This is very rare in practice, outside something like a matrix multiply in a hand-tuned BLAS library.
Most code benefits from HT because it can't saturate a full core on its own, so the existing resources of a single core can run two threads at faster than half speed each. (Usually significantly faster than half).
But when only a single thread is running, the full power of a big core is available for that thread. This is what you lose out on if you design a multicore CPU that has lots of small cores. If Intel CPUs didn't implement hyperthreading, they would probably not include quite so many execution units for a single thread. It helps for a few single-thread workloads, but helps a lot more with HT. So you could argue that it is a case of replicating ALUs because the design supports HT, but it's not essential.
Pentium 4 didn't really have enough execution resources to run two full threads without losing more than you gained. Part of this might be the trace cache, but it also didn't have nearly the amount of execution units. P4 with HT made it useful to use prefetch threads that do nothing but prefetch data from an array the main thread is looping over, as described/recommended in What Every Programmer Should Know About Memory (which is otherwise still useful and relevant). A prefetch thread has a small trace-cache footprint and fetches into the L1D cache used by the main thread. This is what happens when you implement HT without enough execution resources to really make it good.
HT doesn't help at all for code that bottlenecks on a core's peak FMA throughput or something (keeping 10 FMAs in flight with 10 vector accumulators). It can even hurt for code that ends up slowing down a lot from extra cache misses caused by competing for space in the L1D and L2 caches with another thread. (And also the uop cache and L1I cache).
Agner Fog's microarch pdf says the same thing.
Paul Clayton's comments on the question also make some good points about SMT designs in general.
来源:https://stackoverflow.com/questions/35748305/is-duplication-of-state-resources-considered-optimal-for-hyper-threading