Can the LSD issue uOPs from the next iteration of the detected loop?

后端 未结 1 832
日久生厌
日久生厌 2021-01-04 19:18

I was playing investigating the capabilities of the branch unit on port 0 of my Haswell starting with a very simple loop:

BITS 64
GLOBAL _start

SECTION .t         


        
相关标签:
1条回答
  • 2021-01-04 20:06

    All of the uops in your loop are branches (2 per iteration). I think that the reason that `lsd_cycles_4_uops is zero is because of a limitation in the renamer. According to the Intel Optimization Manual Section 2.4.3.1:

    The renamer can allocate two branches each cycle, compared to one branch each cycle in the previous microarchitecture. This can eliminate some bubbles in execution.

    That is a subsection of a section on the Sandy bridge microarchitecture. But to my knowledge, this applies to all later microarchitectures. The maximum renaming throughput is 4 uops per cycle. But at most two uops can be branches. So in this example where all uops are branches, the LSD can never deliver more than 2 uops at any given cycle even in the first iteration of the loop.

    Therefore, 2 branch uops will be allocated in the RS per cycle, and both (one predicated taken and one not taken) can be dispatched per cycle. So the RS occupancy does not grow.

    This limitation does not impact the performance of your program. Executing 2 branch uops per cycle, giving an IPC of 3 per cycle, is already optimal.

    I tried to find a performance event that can capture allocator stalls due to that limitation. The events RESOURCE_STALLS.ANY and UOPS_ISSUED.ANY (with cmask=1 and inv=1) don't seem to be relevant in this case. @IwillnotexistIdonotexist suggested to use IDQ_UOPS_NOT_DELIVERED.CORE. I present the results below for the performance event and all of its supported variants. I also provide the correct meaning of these events because the manual is wrong. T denotes the number of iterations.

    IDQ_UOPS_NOT_DELIVERED.CORE: Counts the number of slots that were not utilized by the allocator. If the program ran for C core cycles, then the total number of slot is 4*C. The measured value is almost equal to 2*T. Since the the number of cycles is T, the number of slots is 4*T, which means that about half the issue slots were not utilized.

    IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE: Counts the number of cycles where zero uops were delivered from the IDQ. The measured value is negligible.

    IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_1_UOP_DELIV.CORE: Counts the number of cycles where at most 1 uops were delivered from the IDQ. The measured value is negligible.

    IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_2_UOP_DELIV.CORE: Counts the number of cycles where at most 2 uops were delivered from the IDQ: The measured value is almost equal to T.

    IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE: Counts the number of cycles where at most 3 uops were delivered from the IDQ: The measured value is almost equal to T.

    Therefore, since the execution time is almost equal to T core cycles, we can conclude that the allocator only allocates exactly 2 uops per cycle in most cycles., which is equal to the dispatch rate.

    Note that the RS in Haswell and Skylake holds unfused uops. So each entry can hold a single unfused uop. See Footnote 2. But this doesn't matter here because there is no microfusion.

    0 讨论(0)
提交回复
热议问题