How to deal with branch prediction when using a switch case in CPU emulation

前端 未结 4 1285
野趣味
野趣味 2021-02-02 11:59

I recently read the question here Why is it faster to process a sorted array than an unsorted array? and found the answer to be absolutely fascinating and it has completely chan

4条回答
  •  礼貌的吻别
    2021-02-02 12:47

    The indirect jump is probably the best thing to do for instruction decoding.

    On older machines, like say the Intel P6 from 1997, the indirect jump would probably get a branch misprediction.

    On modern machines, like say Intel Core i7, there is an indirect jump predictor that does a fairly good job of avoiding the branch misprediction.

    But even on the older machines that do not have an indirect branch predictor, you can play a trick. This trick is (was), by the way, documented in the Intel Code Optimization Guide from way back in the Intel P6 days:

    Instead of generating something that looks like

        loop:
           load reg := next_instruction_bits // or byte or word
           load reg2 := instruction_table[reg]
           jmp [reg]
        label_instruction_00h_ADD: ...
           jmp loop
        label_instruction_01h_SUB: ...
           jmp loop
        ...
    

    generate the code as

        loop:
           load reg := next_instruction_bits // or byte or word
           load reg2 := instruction_table[reg]
           jmp [reg]
        label_instruction_00h_ADD: ...
           load reg := next_instruction_bits // or byte or word
           load reg2 := instruction_table[reg]
           jmp [reg]
        label_instruction_01h_SUB: ...
           load reg := next_instruction_bits // or byte or word
           load reg2 := instruction_table[reg]
           jmp [reg]
        ...
    

    i.e. replace the jump to the top of the instruction fetch/decode/execute loop by the code at the top of the loop at each place.

    It turns out that this has much better branch prediction, even in the absence of an indirect predictor. More precisely, a conditional, single target, PC indexed BTB will be quite a lot better in this latter, threaded, code, than on the original with only a single copy of the indirect jump.

    Most instruction sets have special patterns - e.g. on Intel x86, a compare instruction is nearly always followed by a branch.

    Good luck and have fun!

    (In case you care, the instruction decoders used by instruction set simulators in industry nearly always do a tree of N-way jumps, or the data-driven dual, navigate a tree of N-way tables, with each entry in the tree pointing to other nodes, or to a function to evaluate.

    Oh, and perhaps I should mention: these tables, these switch statements or data structures, are generated by special purpose tools.

    A tree of N-way jumps, because there are problems when the number of cases in the jump table gets very large - in the tool, mkIrecog (make instruction recognizer) that I wrote in the 1980s, I usually did jump tables up to 64K entries in size, i.e. jumping on 16 bits. The compilers of the time broke when the jump tables exceeded 16M in size (24 bits).

    Data driven, i.e. a tree of nodes pointing to other nodes because (a) on older machines indirect jumps may not be predicted well, and (b) it turns out that much of the time there is common code between instructions - instead of having a branch misprediction when jumping to the case per instruction, then executing common code, then switching again, and getting a second mispredict, you do the common code, with slightly different parameters (like, how many bits of the instruction stream do you consume, and where the next set of bits to branch on is (are).

    I was very aggressive in mkIrecog, as I say allowing up to 32 bits to be used in a switch, although practical limitations nearly always stopped me at 16-24 bits. I remember that I often saw the first decode as a 16 or 18 bit switch (64K-256K entries), and all other decodes were much smaller, no bigger than 10 bits.

    Hmm: I posted mkIrecog to Usenet back circa 1990. ftp://ftp.lf.net/pub/unix/programming/misc/mkIrecog.tar.gz You may be able to see the tables used, if you care. (Be kind: I was young then. I can't remember if this was Pascal or C. I have since rewritten it many times - although I have not yet rewritten it to use C++ bit vectors.)

    Most of the other guys I know who do this sort of thing do things a byte at a time - i.e. an 8 bit, 256 way, branch or table lookup.)

提交回复
热议问题