pipeline stalling and bypassing examples

问题

I am taking a course on Computer Architecture. I found this website from another University which has notes and videos which are helping me thus far: CS6810, Univ of Utah. I am working through these series of notes but am in need of some explanation on some of the example problems. I am currently looking at Problem 7, on page 17-18. The solutions are given in the notes on page 18 but I am somewhat unsure of how the professor is reaching the conclusions. He states on his class webpage that he does not provide solutions to anything, so that is out of the picture.

For those that cannot view the pdf, the problem is as follows:

Consider an 8-stage pipeline where Register Read (RR) and Register Write (RW) take a full cycle. Key: Instruction Fetch = IF, Decode = DE, ALU = AL, Data Memory = DM, Latch # = L#

L1-->IF-->L2-->DE-->L3-->RR-->L4-->AL-->L5-->AL-->L6-->DM-->L7-->DM-->L8-->RR-->L9

Given the following series of instructions, determine the number of stalls for the 2nd instruction, with and without bypassing

ADD R1 + R2 -> R3, ADD R3 + R4 -> R5 : without bypassing 5, with bypassing 1

LD[R1] -> R2, ADD R2 + R3 -> R4 : without bypassing 5, with bypassing 3

LD[R1] -> R2, SD[R2] -> R3 : without bypassing 5, with bypassing 3

LD[R1] -> R2, SD[R3] -> R2 : without bypassing 5, with bypassing 1

I understand how each of them will generate 5 stalls without bypassing, and I understand how the first one will only generate 1 stall with bypassing, but I am uncertain of how the stalls with bypassing are generated with 2-4.

Any help would be appreciated.

edit (for further clarification, my understanding of how the cases would look): ST = Stall, latches are implied

IF-->DE-->RR-->AL-->AL-->DM-->DM-->RW
     IF-->DE-->ST-->ST-->ST-->ST-->ST-->RR-->AL-->AL-->DM-->DM-->RW (without)
     IF-->DE-->RR-->ST-->AL-->AL-->DM-->DM-->RW                     (with)

Without bypassing, I2 stalls before entering RR and has to wait until R3 is written before it can enter RR; this understanding is universal amongst all the cases. With bypassing, I2 can enter RR but stalls until the arithmetic is done by I1, which is after the second ALU stage.

IF-->DE-->RR-->AL-->AL-->DM-->DM-->RW
     IF-->DE-->ST-->ST-->ST-->ST-->ST-->RR-->AL-->AL-->DM-->DM-->RW (without)
     IF-->DE-->RR-->ST-->ST-->ST-->AL-->AL-->DM-->DM-->RW           (with)

With bypassing, I2 can enter RR but must wait until R2 processed and this occurs after the second DM stage of I1.

IF-->DE-->RR-->AL-->AL-->DM-->DM-->RW
     IF-->DE-->ST-->ST-->ST-->ST-->ST-->RR-->AL-->AL-->DM-->DM-->RW (without)
     IF-->DE-->RR-->ST-->ST-->ST-->AL-->AL-->DM-->DM-->RW           (with)

With bypassing, I2 can enter RR but must wait until R2 is processed and this occurs after the second DM stage of I1.

IF-->DE-->RR-->AL-->AL-->DM-->DM-->RW
     IF-->DE-->ST-->ST-->ST-->ST-->ST-->RR-->AL-->AL-->DM-->DM-->RW (without)
     IF-->DE-->RR-->AL-->AL-->ST-->DM-->DM-->RW                     (with)

With bypassing, I2 can continue along the pipeline until the second ALU stage and it must wait here until it can pull R2, which isn't processed by I1 until after its second DM stage.

And one more, just to make sure I understand everything:

I1: R1+R2-->R3, I2: SD[R4]<--R3

IF-->DE-->RR-->AL-->AL-->DM-->DM-->RW
     IF-->DE-->ST-->ST-->ST-->ST-->ST-->RR-->AL-->AL-->DM-->DM-->RW (without)
     IF-->DE-->RR-->AL-->AL-->DM-->DM-->RW                          (with)

It is my understanding that without bypassing, it would stall in the same place for the same number of stalls (5). With bypassing, however, there would be 0 stalls because I2 would use the ALU stages to calculate the register address and when it came time to make the store, it could take the information from the 2nd ALU stage in I1.

回答1:

The stalls in cases 2 and 3 come from the second instruction depending in its first ALU stage on the result of the load in the previous instruction (which is not available until after the second Data Memory stage, so the stall if for the earlier instruction's second ALU stage and the two Data Memory stages). (L8 of the first instruction lines up with L4 of the second.)

 L1-->IF-->L2-->DE-->L3-->RR-->L4-->AL-->L5-->AL-->L6-->DM-->L7-->DM-->L8-->RW-->L9
           L1-->IF-->L2-->DE-->L3-->RR-->STALL---->STALL---->STALL---->L4-->AL-->L5-->AL-->L6-->DM-->L7-->DM-->L8-->RW-->L9

For case 4, the value stored in memory by the second instruction is (presumably) not needed until the first Data Memory stage and the address generation part of the second instruction has no dependency on the first instruction. (L8 of the first instruction lines up with L6 of the second.)

 L1-->IF-->L2-->DE-->L3-->RR-->L4-->AL-->L5-->AL-->L6-->DM-->L7-->DM-->L8-->RW-->L9
           L1-->IF-->L2-->DE-->L3-->RR-->L4-->AL-->L5-->AL-->STALL---->L6-->DM-->L7-->DM-->L8-->RW-->L9

(Since the writing to memory is a commitment of architectural state similar to writing the register, it might be more typical for a pipeline not to require the stored value until the RW stage.)

Without bypassing all register source operands are retrieved from the register file in the Register Read stage. Since a new value is written to the register file in the Register Write stage, without bypassing the given 8-stage pipeline will require 5 cycles of stall for such dependent cases.

 L1-->IF-->L2-->DE-->L3-->RR-->L4-->AL-->L5-->AL-->L6-->DM-->L7-->DM-->L8-->RW-->L9
           L1-->IF-->L2-->DE-->STALL---->STALL---->STALL---->STALL---->STALL---->L3-->RR-->L4-->AL-->L5-->AL-->L6-->DM-->L7-->DM-->L8-->RW-->L9

With bypassing, a dependent value can be communicated from the earliest stage it is available (the end of the second ALU stage for arithmetic instructions, the end of the second Data Memory stage for load instructions)--rather than the Register Write stage--to the earliest stage of the dependent instruction in which the value is needed (before the ALU stages for arithmetic instructions and address computation, before the Data Memory stages for stores if stores require the stored value early as seems to be the case in this pipeline)--rather than the Register Read stage.

(Aside: Some pipelines perform the register write in the first half of the cycle and the register read in the second half of the cycle. Not only can this reduce the number of access ports needed for the register file, but it also allows values to be available from the register file one cycle earlier since the read of a newly written value can occur in the later half of the same cycle as the write. This reduces the amount of bypassing needed.)

来源：https://stackoverflow.com/questions/19041315/pipeline-stalling-and-bypassing-examples

标签

pipeline

computer-architecture