How to deal with branch prediction when using a switch case in CPU emulation

前端 未结 4 1297
野趣味
野趣味 2021-02-02 11:59

I recently read the question here Why is it faster to process a sorted array than an unsorted array? and found the answer to be absolutely fascinating and it has completely chan

相关标签:
4条回答
  • 2021-02-02 12:41

    As the branches on your 256-way switch statement are densely packed the compiler will implement this as a jump table, so you're correct in that you'll trigger a single branch mispredict every time you pass through this code (as the indirect jump won't display any kind of predictable behaviour). The penalty associated with this will be around 15 clock cycles on a modern CPU (Sandy Bridge), or maybe up to 25 on older microarchitectures that lack a micro-op cache. A good reference for this sort of thing is "Software optimisation resources" on agner.org. Page 43 in "Optimizing software in C++" is a good place to start.

    http://www.agner.org/optimize/?e=0,34

    The only way you could avoid this penalty is by ensuring that the same instructions are execution regardless of the value of the opcode. This can often be done by using conditional moves (which add a data dependency so are slower than a predictable branch) or otherwise looking for symmetry in your code paths. Considering what you're trying to do this is probably not going to be possible, and if it was then it would almost certainly add a overhead greater than the 15-25 clock cycles for the mispredict.

    In summary, on a modern architecture there's not much you can do that'll be more efficient than a switch/case, and the cost of mispredicting a branch isn't as much as you might expect.

    0 讨论(0)
  • 2021-02-02 12:46

    I thought I'd add something since no one mentioned it.

    Granted, the indirect jump is likely to be the best option.

    However, should you go with the N-compare way, there are two things that come to my mind:

    First, instead of doing N equality compares, you could do log(N) inequality compares, testing your instructions based on their numerical opcode by dichotomy (or test the number bit by bit if the value space is near to full) .This is a bit like a hashtable, you implement a static tree to find the final element.

    Second, you could run an analysis on the binary code you want to execute. You could even do that per binary, before execution, and runtime-patch your emulator. This analysis would build a histogram representing the frequency of instructions, and then you would organize your tests so that the most frequent instructions get predicted correctly.

    But I cant see this being faster than a medium 15 cycles penalty, unless you have 99% of MOV and you put an equality for the MOV opcode before the other tests.

    0 讨论(0)
  • 2021-02-02 12:47

    The indirect jump is probably the best thing to do for instruction decoding.

    On older machines, like say the Intel P6 from 1997, the indirect jump would probably get a branch misprediction.

    On modern machines, like say Intel Core i7, there is an indirect jump predictor that does a fairly good job of avoiding the branch misprediction.

    But even on the older machines that do not have an indirect branch predictor, you can play a trick. This trick is (was), by the way, documented in the Intel Code Optimization Guide from way back in the Intel P6 days:

    Instead of generating something that looks like

        loop:
           load reg := next_instruction_bits // or byte or word
           load reg2 := instruction_table[reg]
           jmp [reg]
        label_instruction_00h_ADD: ...
           jmp loop
        label_instruction_01h_SUB: ...
           jmp loop
        ...
    

    generate the code as

        loop:
           load reg := next_instruction_bits // or byte or word
           load reg2 := instruction_table[reg]
           jmp [reg]
        label_instruction_00h_ADD: ...
           load reg := next_instruction_bits // or byte or word
           load reg2 := instruction_table[reg]
           jmp [reg]
        label_instruction_01h_SUB: ...
           load reg := next_instruction_bits // or byte or word
           load reg2 := instruction_table[reg]
           jmp [reg]
        ...
    

    i.e. replace the jump to the top of the instruction fetch/decode/execute loop by the code at the top of the loop at each place.

    It turns out that this has much better branch prediction, even in the absence of an indirect predictor. More precisely, a conditional, single target, PC indexed BTB will be quite a lot better in this latter, threaded, code, than on the original with only a single copy of the indirect jump.

    Most instruction sets have special patterns - e.g. on Intel x86, a compare instruction is nearly always followed by a branch.

    Good luck and have fun!

    (In case you care, the instruction decoders used by instruction set simulators in industry nearly always do a tree of N-way jumps, or the data-driven dual, navigate a tree of N-way tables, with each entry in the tree pointing to other nodes, or to a function to evaluate.

    Oh, and perhaps I should mention: these tables, these switch statements or data structures, are generated by special purpose tools.

    A tree of N-way jumps, because there are problems when the number of cases in the jump table gets very large - in the tool, mkIrecog (make instruction recognizer) that I wrote in the 1980s, I usually did jump tables up to 64K entries in size, i.e. jumping on 16 bits. The compilers of the time broke when the jump tables exceeded 16M in size (24 bits).

    Data driven, i.e. a tree of nodes pointing to other nodes because (a) on older machines indirect jumps may not be predicted well, and (b) it turns out that much of the time there is common code between instructions - instead of having a branch misprediction when jumping to the case per instruction, then executing common code, then switching again, and getting a second mispredict, you do the common code, with slightly different parameters (like, how many bits of the instruction stream do you consume, and where the next set of bits to branch on is (are).

    I was very aggressive in mkIrecog, as I say allowing up to 32 bits to be used in a switch, although practical limitations nearly always stopped me at 16-24 bits. I remember that I often saw the first decode as a 16 or 18 bit switch (64K-256K entries), and all other decodes were much smaller, no bigger than 10 bits.

    Hmm: I posted mkIrecog to Usenet back circa 1990. ftp://ftp.lf.net/pub/unix/programming/misc/mkIrecog.tar.gz You may be able to see the tables used, if you care. (Be kind: I was young then. I can't remember if this was Pascal or C. I have since rewritten it many times - although I have not yet rewritten it to use C++ bit vectors.)

    Most of the other guys I know who do this sort of thing do things a byte at a time - i.e. an 8 bit, 256 way, branch or table lookup.)

    0 讨论(0)
  • 2021-02-02 13:00

    On the contrary, switch statements are likely to be converted to jump tables, which means they perform possibly a few ifs (for range checking), and a single jump. The ifs shouldn't cause a problem with branch prediction because it is unlikely you will have a bad op-code. The jump is not so friendly with the pipeline, but in the end, it's only one for the whole switch statement..

    I don't believe you can convert a long switch statement of op-codes into any other form that would result in better performance. This is of course, if your compiler is smart enough to convert it to a jump table. If not, you can do so manually.

    If in doubt, implement other methods and measure performance.

    Edit

    First of all, make sure you don't confuse branch prediction and branch target prediction.

    Branch prediction solely works on branch statements. It decides whether a branch condition would fail or succeed. They have nothing to do with the jump statement.

    Branch target prediction on the other hand tries to guess where the jump will end up in.

    So, your statement "there's no way the branch predictor can predict the jump" should be "there's no way the branch target predictor can predict the jump".

    In your particular case, I don't think you can actually avoid this. If you had a very small set of operations, perhaps you could come up with a formula that covers all your operations, like those made in logic circuits. However, with an instruction set as big as a CPU's, even if it were RISK, the cost of that computation is much higher than the penalty of a single jump.

    0 讨论(0)
提交回复
热议问题