Using SIMD on amd64, when is it better to use more instructions vs. loading from memory?

前端 未结 2 1471
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-17 12:14

I have some highly perf sensitive code. A SIMD implementation using SSEn and AVX uses about 30 instructions, while a version that uses a 4096 byte lookup table uses about 8

相关标签:
2条回答
  • 2021-01-17 12:44

    Look-up tables are rarely a performance win in real-world code, especially when they're as large as 4k bytes. Modern processors can execute computations so quickly that it is almost always faster to just do the computations as needed, rather than trying to cache them in a look-up table. The only exception to this is when the computations are prohibitively expensive. That's clearly not the case here, when you're talking about a difference of 30 vs. 8 instructions.

    The reason your micro-benchmark is suggesting that the LUT-based approach is faster is because the entire LUT is getting loaded into cache and never evicted. This makes its usage effectively free, such that you are comparing between executing 8 and 30 instructions. Well, you can guess which one will be faster. :-) In fact, you did guess this, and proved it with explicit cache invalidation.

    In real-world code, unless you're dealing with a very short, tight loop, the LUT will inevitably be evicted from the cache (especially if it's as large as this one is, or if you execute a lot of code in between calls to the code being optimized), and you'll pay the penalty of re-loading it. You don't appear to have enough operations that need to be performed concurrently such that this penalty can be mitigated with speculative loads.

    The other hidden cost of (large) LUTs is that they risk evicting code from the cache, since most modern processors have unified data and instruction caches. Thus, even if the LUT-based implementation is slightly faster, it runs a very strong risk of slowing everything else down. A microbenchmark won't show this. (But actually benchmarking your real code will, so that's always a good thing to do when feasible. If not, read on.)

    My rule of thumb is, if the LUT-based approach is not a clear performance win over the other approach in real-world benchmarks, I don't use it. It sounds like that is the case here. If the benchmark results are too close to call, it doesn't matter, so pick the implementation that doesn't bloat your code by 4k.

    0 讨论(0)
  • 2021-01-17 12:50

    Cody Gray already covered most of the bases above, so I'd just add a few of my own thoughts. Note that I'm not as negative on LUTs as Cody: rather than giving them a general "thumbs down", I think you need to carefully analyze the downsides. In particular, the smaller the LUT, the more likely it can be compared apples-for-apples to the computation approach.

    There are often cases where the values are very expensive to calculate on the fly, or where only small LUTs are needed. I do use the same rule of thumb for near-ties though: if the LUT approach is only a bit faster, I'll generally chose the computation approach, with a few exceptions (e.g., very large input size such that the LUT will be resident and used for many computations).

    SIMD

    Most of the discussion following this section is not SIMD specific - it applies both to scalar and SIMD codes. Before we get there, let's talk a bit about LUTs as it pertains specifically to SIMD.

    For SIMD code LUTs have some advantages and also additional drawbacks. The main drawback is that, outside of the PSHUFB-type trickery discussed below, there is no good SIMD equivalent to scalar LUT code. That is, while you can do N (where N is the SIMD width) parallel independent calculations per instruction using SIMD, you generally cannot do N lookups. Usually, you are restricted to the same number of lookups per cycle in SIMD code as your are in LUT code, with 2/cycle being a common number of modern hardware.

    This restriction is not just some oversight in SIMD ISAs - it is a fairly fundamental outcome of the way L1 caches are designed: they only have a very small number of read ports (as above, 2 is common) and each added port significantly increases the L1 size, power consumption, latency, etc. So you just aren't going to see general purpose CPUs offer 16-way loads from memory any time soon. You do often see a gather instruction available, but it doesn't get around this fundamental restriction: you'll still be limited by the 2-loads-per-cycle limit. The best you can hope for in gather is that it can notice when two loads are at the same address (or at least "near enough") such that they can be satisfied by the same load6.

    What SIMD does let you do is wider loads. So you can load say 32 consecutive bytes at once. This usually isn't useful for directly vectorizing a scalar lookup, but it can enable some other tricks (e.g., the vector can itself by a table, and you do a second lookup using "LUT in register" stuff as described below).

    On the other hand, LUTs often do find a new niche in SIMD code because:

    1. The fact that you've vectorized the code means you probably expect a moderate or large problem size, which helps amortize the cache costs of LUTs.

    2. More than scalar code, SIMD likes to load a lot of masks and other constants: yet it is often hard to calculate things like shuffle masks via "computation", so LUTs are often a natural fit here.

    3. SIMD instruction sets often have no way to directly load an immediate constant, unlike their scalar brethren, so you are often loading fixed constants from memory anyway. At that point, it makes sense to see if some part of the subsequent calculation can be folded into the load by doing a lookup rather than loading a fixed constant (you are already paying the latency penalty, etc).

    4. SIMD instruction sets often have shuffle/permutation instructions that can be re-purposed into "lookup within a register" functionality as described below.

    One thing to keep in mind when doing LUTs in SIMD code is keeping the table small. You want to avoid a 16 or 32-byte wide table entries if you can. In addition to the techniques for shrinking the table below, you can often put broadcast or "unpack" instructions to good use here if the entries have some regularity. In some cases (recent x86) such instructions may be "free" when replacing a plain load.

    Incremental Decision Problem

    It's true that micro-benchmarks almost always unfairly favor LUT-based approaches - but by how much depends a lot on the code in question. As usual, the best way to decide is simply to profile the real-world load, both ways. Of course, that's not always practical, and also suffers from an "incremental decision problem"1...

    If you make a series of optimization decisions based on benchmarks, taking the "best" approach each time based on a real-world benchmarks, later decisions may invalidate earlier ones. For example, let's say you are considering using a LUT or computation for function A. You might find that in a real-world benchmark, a LUT is somewhat faster, so you implement that. The next day, you test new implementations of function B, again with a LUT vs computation approach - you might again find that a LUT is better than computation, so you implement that - but if you went back and tested A, the results may be different! Now A might be better of with a computation approach since the addition of a LUT for function B caused increased cache contention. If you had optimized the functions in the reverse order, the problem would not occur2.

    So, in principle, functions A and B need to be optimized together, and that same principle can often apply to the entire program. Furthermore, your decisions for A and B also affect some hypothetical future function C, not even written yet, that may also like to do some lookups and may make even better use of limited cache space than A or B.

    All that to say that not only do you need to benchmark in a real-world scenario, you need to keep the impact on existing functions and even future functions in mind.

    Ballpark LUT Estimates

    When real-world testing is impractical or ineffective3, or you want another approach to validate your testing results, one might try to ballpark range of performance of the LUT approach from first principles.

    For example, take some ballmark number for an cache miss to DRAM like 200 cycles, and then you can estimate the worst-case performance of the LUT for various iteration sizes of your algorithm. For example, if the LUT approach takes 10 cycles when it hits in the cache, versus 20 cycles for the computation approach, and has a table of 640 bytes (10 cache lines), then you may pay a cost of 10 * 200 = 2,000 cycles to bring in the whole LUT, so you need to iterate at least about 200 times to pay that cost back. You may also want to double the cache miss cost, since bringing the LUT into cache presumably often also causes a downstream miss for whatever line was evicted.

    In this way you can sometimes say: "Yes, the LUT has a worst-case cost of X cycles due to cache effects, but we almost always pay that back because we typically call the method Y times at a savings of Z cycles/call".

    This is, of course, a rough and crude worst-case estimate. You may be able to make some better estimates if you know some more detailed characteristics of your application, such as whether the whole working set usually fits in some level of the cache. Finally, you can even consider tools like cachegrind to get some quantitative insight into how the LUT and computation code interact with the cache (but perhaps that time could also be better spent on generating real-world test cases).

    I-Cache Misses

    One thing not often mentioned in the LUT versus computation debate is the effect on I$. Some programs, especially big object oriented or branchy ones4 are more sensitive to instruction-cache pressure than data-cache pressure. If the computation-based approach takes significantly more static instructions (i.e., code side, not executed instruction count), it may somewhat favor the LUT. The same argument can be made, for example, when deciding to unroll or aggressively vectorize loops or not.

    Unfortunately, this effect is inherently "whole program" and non-linear, so it is hard to measure. That is, you may choose larger-but-faster code several times without an noticeable instruction-caching penalty, but then you cross some threshold and you get a few % drop - the proverbial straw that broke the camel's back. Thus, it is hard to measure and make good decisions in isolation.

    Hybrid Approaches

    Often what is compared is a pure-LUT vs computation approach. Often there is a middle ground where you can use a much smaller LUT combined with some computation.

    This computation might come before the lookup, where you map the input domain into an index with a smaller domain such that all inputs mapped to the same index have the same answer. A simple example would be calculating parity: you can do this "quickly" (in a micro-benchmark sense!) with a 65K lookup table, but you could also just xor-fold the input like input ^ (input >> 8) then use the bottom byte to index into a 256-entry table. So you cut down the size of the table by a factor of 256, at the cost of a couple more instructions (but still quite a bit faster than the full computation approach).

    Sometimes the computation occurs after the lookup. This often takes the form of storing the table in a slightly more "compressed" format and decompressing the output. Imagine, for example, some function which maps a byte to a boolean. Any such function can be implement by a lut bool[256], at the cost of 256 bytes. However, each entry really only needs one bit (32 bytes total), rather than one byte - if you are willing to "decompress" after the lookup, like return bitwise_lut[value] & (1 << (value & 7)).

    A totally different hybrid approach is to choose between the LUT and computation approaches at runtime, based on the problem size. For example, you might have a LUT-based approach to decode some base64 encoded data, which you know is fast but imposes a non-trivial cost on the cache and may suffer warmup misses, and you may have a computation-based approach which is slower in the long run but has no such issues. Since you know the size of the data up-front, why not simply choose the best algorithm based on some crossover point you calculate or derive through testing?

    That might seem like it gives you the best of both worlds, but it certainly isn't free: you pay a price in code complexity, testing complexity, the chance of latency bugs in one algorithm that aren't in the other, an occasional branch mis-predict on the initial check, and increased total code size.

    Reducing Cache Misses

    It is pretty clear by now that the main hard-to-measure factor in the performance of a LUT algorithm is the cache impact. Are there any tricks beyond the above we can use to reduce them?

    Locating the LUT next to the code

    In principle, it seems like for very small LUTs, you can simply put the LUT in the same cache line as the code. This works best if your LUT is somewhat smaller than a cache line; specifically, it works best if adding it to the size of the function doesn't change the total number of cache lines for the combined LUT + code, but may still have small advantages even if that is not the case5.

    I'm not sure why this isn't used more, perhaps there are some downsides I am not aware of.

    LUTs in GP and SIMD registers

    The extreme version of the "put the LUT near the code" approach is to locate the LUT in the code. In scalar code you can do this by loading a constant into a register, and then doing something like a variable shift-and-mask to pick out an element in register. For example, you could use a register as a 16-element boolean LUT to calculate parity.

    In general, an N-bit general-purpose register can be used to implement a LUT that doesn't exceed N bits. So a 64-bit register could implement an 8-element LUT for byte values, or a 64-element LUT for boolean values, etc.

    In the world of x86-64 SIMD, you can bring this idea to an extreme with the PSHUFB instruction (first available in SSSE3). In its 128-bit SSE incarnation, it effectively allows you to perform 16 parallel 4-bit to 8-bit lookups in a single cycle. The AVX2 version allows you to perform 32 such lookups in parallel. So you can do lookups on steroids, without most of the downsides of a real LUT (i.e., the table is held in a register - although you may need one load to get it there in the first place).

    This only works for small (16-element tables) - although you can extend that to 32, 64, etc., element tables with 2, 4, ..., PSHUFB operations and a similar number of blending operations, but this is still only viable for fairly small tables.


    1 Perhaps you could also call this is the "path-dependent optimization" or "non-additive optimization" problem.

    2 Of course, knowing that optimizing B then A would have worked in this case is more of academic interest than practical value, since there isn't a good way to know the correct ordering in advance.

    3 This is much more common that you might think - it's not just laziness that prevents effective real-world testing, it can include many other factors such as (a) no single "canonical" load because an application or library is used in very different contexts, (b) no "canonical" load because the application is unreleased and the actual usage patterns aren't known yet, (c) inability to test on future hardware, which may not even exist yet, (d) the whole application being so much larger than the function in question that differences are lost in the noise, (e) inability to replicate real-world cases due to data privacy issues (can't get customer data), etc., etc.

    4 Compilers, browsers, and all sorts of JIT'd code comes to mind.

    5 For example, by using a cache line that is brought in by sequential prefetch, that might otherwise be wasted, or by at least locating the code and LUT in the same 4K page, perhaps saving a TLB miss.

    6 It's worth noting that on Intel, despite being around for at least 4 new chip releases, gather still doesn't do this: it is limited to, at best, 2 loads per cycle even if there is duplication in the loaded indices.

    0 讨论(0)
提交回复
热议问题