Vectorisation of for-loop with data dependecy

前端 未结 1 701
深忆病人
深忆病人 2021-01-20 17:35

I have an implementation of BiCCG (Conjugate Gradient) based matrix solver which also accounts for periodicity. It happens to be the case that the implementation is compute

1条回答
  •  攒了一身酷
    2021-01-20 18:05

    This is a classic problem of loop carried dependences. Every iteration of yours depend on some other iterations (to have finished), and the only way it can be scheduled is thus serially.

    But that is just because how your loop is written.

    You mention that R[i][j][k] depends on the calculation of R[i-1][j][k], R[i][j-1][k], R[i][j][k-1]. I see three dependences here -

    1. [1, 0, 0]
    2. [0, 1, 0]
    3. [0, 0, 1]

    I hope this representation is intuitive.

    For your present scenario, dependence 1) and 2) are not an issue because there is a 0 in k and there are 1 in i/j, which means that the iteration does not depend on previous iterations of k to complete for these two dependences.

    The problem is because of 3). Since there is a 1 in k, every iteration depends on it's previous iteration. If somehow we were able to bring a number >0 in i/j we would be done. A loop skew transformation lets us do exactly the same.

    A 3D example is slightly difficult to understand. So let's look at 2D example with i and j.

    Suppose - R[i][j] depends on R[i-1][j] and R[i][j-1]. We have the same problem.

    If we have to represent this in a picture it looks like this -

    . <- . <- .
         |    |
         v    v
    . <- . <- .
         |    |
         v    v
    .    .    .
    

    In this picture, every point represents and iteration (i,j) and the arrows originating from each point, point to the iteration it depends on. It is clear to see why we cannot parallelize the inner most loop here.

    But suppose we did the skewing as -

            .
           /|   
          / |
        .   .
       /|  /|
      / | / |
    .   .   . 
       /|    
      / |  
    .   .     
    
    
    . 
    

    And if you draw the same arrows as in the above picture (I cannot draw diagonal arrows in the ASCII art).

    You will see that all the arrows are pointing downwards i.e. they atleast go on iteration down, which means you can parallelize the horizontal loop.

    Now say your new loop dimensions are y (outer loop) and x (inner loop),

    your original variables i, j will be

    j = x and i = x - y

    Your loop body thus becomes -

    for ( y = 0; y < j_max + i_max; y++) 
        for ( x = 0; x < j_max; x++)
            R_dash[y][x] = R_dash[y-1][x-1] + R_dash[y-1][x];
    

    Where R_dash is the skewed domain and has a one to one mapping to R

    You will see that both R_dash[y-1][x-1] and R_dash[y-1][x] will be computed in some previous iteration of y. And hence you can completely parallelize the x loop.

    The transformation applied here is

    i -> i, j -> i + j.

    You can similarly work it out for 3 dimensions.

    For further understanding on how affine transformations work and how they can be used to introduce parallelism, you can see these lecture notes.

    0 讨论(0)
提交回复
热议问题