Why are conditionally executed instructions not present in later ARM instruction sets?

后端 未结 7 995
隐瞒了意图╮
隐瞒了意图╮ 2020-12-31 02:57

Naively, conditionally executed instructions seem like a great idea to me.

As I read more about ARM (and ARM-like) instruction sets (Thumb2, Unicore, AArch64) I find

相关标签:
7条回答
  • 2020-12-31 03:54

    "Why are conditionally executed instructions not present ..." "Was conditional execution a mistake at the time, or have subsequent changes made it an expensive waste of instruction bits?"

    Wikipedia's article on "Predication - Disadvantages" provides a bit of info:

    "Disadvantages
    Predication's primary drawback is in increased encoding space. In typical implementations, every instruction reserves a bitfield for the predicate specifying under what conditions that instruction should have an effect. When available memory is limited, as on embedded devices, this space cost can be prohibitive. However, some architectures such as Thumb-2 are able to avoid this issue (see below). Other detriments are the following:

    • Predication complicates the hardware by adding levels of logic to critical paths and potentially degrades clock speed.
    • A predicated block includes cycles for all operations, so shorter paths may take longer and be penalized.

    Predication is most effective when paths are balanced or when the longest path is the most frequently executed, but determining such a path is very difficult at compile time, even in the presence of profiling information.

    ...

    In the ARM architecture, the original 32-bit instruction set provides a feature called conditional execution that allows most instructions to be predicated by one of 13 predicates that are based on some combination of the four condition codes set by the previous instruction. ARM's Thumb instruction set (1994) dropped conditional execution to reduce the size of instructions so they could fit in 16 bits, but its successor, Thumb-2 (2003) overcame this problem by using a special instruction which has no effect other than to supply predicates for the following four instructions. The 64-bit instruction set introduced in ARMv8-A (2011) replaced conditional execution with conditional selection instructions.".

    In "Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools", by Joseph A. Fisher, Paolo Faraboschi, and Cliff Young, on page 172:

    "... full predication complicates the hardware, the ISA, and the compiler. Unlike speculation, which favors deeper pipelines and faster clocks, predication adds levels of logic to critical paths and potentially degrades clock speed. Predicate operands use precious encoding bits in all instructions, and bypassing operations with predicate operands considerably complicates the forwarding logic. Predication's benefits for acyclic or "control-oriented" code have been the subject of lively academic and commercial debate, and the jury is still out on whether the benefits of predication justify the massive hardware cost to support full predication.

    The argument between full predication and partial predication is even more subtle. Full predication is more expressive and allows the compiler to predicate blocks that contain any combination of operations. Partial predication requires aggressive specu-lation and embeds some intrinsic limitations (for example, it cannot predicate blocks containing call operations). In terms of implementation complexity, full predication has much higher demands on the instruction encodings and the microarchitecture, as described previously, whereas partial predication with select operations is a good match for most microarchitectures and datapaths and has no impact on complexity, area, or speed.

    Predication in the Embedded Domain
    In the embedded domain, it is difficult to justify the code size penalty of a large set of predicate registers. Full predication implies a "pay up front' philosophy, in which the cost of the predicate machinery needs to be paid regardless of how often it is used. For example, adding 6 predicate bits to address 64 predicates helped push the IPF encoding to 42 bits per operation—an approach that would be prohibitively expensive for an embedded processor. ...".

    Cost, TDP, and Patents, even the technical skill level necessary to develop a competing product all come into play. In this case it was a cost benefit realized from updated coding techniques, what was thought to be wanted wasn't really used, or at least not effectively (for the cost of its implementation).

    As explained in another answer the ARM manual says little about the reason, less about it than the RISC manual does, here is what ARM had to say on page 8 of the "ARMv8 Instruction Set Overview":

    "3 A64 OVERVIEW
    The A64 instruction set provides similar functionality to the A32 and T32 instruction sets in AArch32 or ARMv7. However just as the addition of 32-bit instructions to the T32 instruction set rationalized some of the ARM ISA behaviors, the A64 instruction set includes further rationalizations. The highlights of the new instruction set are as follows:

    • ...

    • Reduced conditionality. Fewer instructions can set the condition flags. Only conditional branches, and a handful of data processing instructions read the condition flags. Conditional or predicated execution is not provided, and there is no equivalent of T32’s IT instruction (see §3.2).

    ...

    3.2 Conditional Instructions
    The A64 instruction set does not include the concept of predicated or conditional execution. Benchmarking shows that modern branch predictors work well enough that predicated execution of instructions does not offer sufficient benefit to justify its significant use of opcode space, and its implementation cost in advanced implementations.

    A very small set of “conditional data processing” instructions are provided. These instructions are unconditionally executed but use the condition flags as an extra input to the instruction. This set has been shown to be beneficial in situations where conditional branches predict poorly, or are otherwise inefficient.

    Further information is provided in section "4.3 Condition Codes", but it doesn't rationalize how the decision was arrived at.

    The designers of the RISC-V ISA (an unrelated recently-designed ISA) explain (http://riscv.org/spec/riscv-spec-v2.0.pdf on page 23) some of what goes into designing a processor:

    "The conditional branches were designed to include arithmetic comparison operations between two registers (as also done in PA-RISC, Xtensa, and MIPS R6), rather than use condition codes (x86, ARM, SPARC, PowerPC), or to only compare one register against zero (Alpha, MIPS), or two registers only for equality (MIPS). This design was motivated by the observation that a combined compare-and-branch instruction fits into a regular pipeline, avoids additional condition code state or use of a temporary register, and reduces static code size and dynamic instruction fetch traffic.

    ...

    Both conditional move and predicated instructions add complexity to out-of-order microarchitectures, adding an implicit third source operand due to the need to copy the original value of the destination architectural register into the renamed destination physical register if the predicate is false. Also, static compile-time decisions to use predication instead of branches can result in lower performance on inputs not included in the compiler training set, especially given that unpredictable branches are rare, and becoming rarer as branch prediction techniques improve.

    We note that various microarchitectural techniques exist to dynamically convert unpredictable short forward branches into internally predicated code to avoid the cost of flushing pipelines on a branch mispredict [6, 10, 9] and have been implemented in commercial processors [17].

    The simplest techniques just reduce the penalty of recovering from a mispredicted short forward branch by only flushing instructions in the branch shadow instead of the entire fetch pipeline, or by fetching instructions from both sides using wide instruction fetch or idle instruction fetch slots. More complex techniques for out-of-order cores add internal predicates on instructions in the branch shadow, with the internal predicate value written by the branch instruction, allowing the branch and following instructions to be executed speculatively and out-of-order with respect to other code [17].

    [6] Timothy H. Heil and James E. Smith. Selective dual path execution. Technical report, Uni- versity of Wisconsin - Madison, November 1996.

    [9] Hyesoon Kim, Onur Mutlu, Jared Stark, and Yale N. Patt. Wish branches: Combining conditional branching and predication for adaptive predicated execution. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, MICRO 38, pages 43–54, 2005.

    [10] A. Klauser, T. Austin, D. Grunwald, and B. Calder. Dynamic hammock predication for non-predicated instruction set architectures. In Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, PACT ’98, Washington, DC, USA, 1998.

    [17] Balaram Sinharoy, R. Kalla, W. J. Starke, H. Q. Le, R. Cargnoni, J. A. Van Norstrand, B. J. Ronchetti, J. Stuecheli, J. Leenstra, G. L. Guthrie, D. Q. Nguyen, B. Blaner, C. F. Marino, E. Retter, and P. Williams. IBM POWER7 multicore server processor. IBM Journal of Research and Development, 55(3):1–1, 2011.

    Removing predicated instructions on 64-bit ARM freed four bits on the encoding of every instruction, this allowed adding one bit to each register field, thus doubling the number of registers.

    In my opinion it is an error to omit elison ability in a Server Processor getting pinned to a Fabric, but tradeoffs are made. It is not a mistake (to have it, well implemented), it is expensive, it's not a waste (the bits are smart and mind their own business). Conditionals were an easier/better choice.

    It is like any CPU Extension, or adding a GPU; if you can make skillful use of your Tools then your good to go, otherwise pack light.

    Wikipedia Quote: "According to different benchmarks, TSX can provide around 40% faster applications execution in specific workloads, and 4–5 times more database transactions per second (TPS).".

    It's 'costly' (for some situations) but important for the current style of programming, or more pessimistically a means to score far higher in synthetic Benchmarks.

    Someday it will be as easy as Lego, and you will be able to ask 'it' to assemble itself and do your biddingask; until then the Processor MUST support Programmer (and Compiler writers) laziness - thus the rarity of Programs that can run mostly on the GPU (but we are getting there).

    Therefore, the removal of (great) features that are thought unwanted or were not implemented in a cost effective and competitive manner.

    Thus, TSX rules currently; but ARM CPUs need fancy Threads for their Fabric too.

    URL References:

    AMD: https://en.wikipedia.org/wiki/Advanced_Synchronization_Facility

    Intel: https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions

    .

    0 讨论(0)
提交回复
热议问题