Efficient by-hand loop unrolling

问题

I have this C code:

for (k = 0; k < n_n; k++) {
    if (k == i || k == j) continue;
    dd=q2_vect[k]-q1_vect;
    d2=dd*dd;
    if (d2<0) {
        a=1;
        break;
    }       
}

For compiler optimization reasons (on the SPE of the Cell processor), I need to unloop this by hand, so I tried:

dd=q2_vect[0]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;

dd=q2_vect[1]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;

dd=q2_vect[2]-q1_vect;
d2=dd*dd;
if (d2<0)goto done;

.....
.....

// end
goto notdone;

done: 
ok=0;

notdone:
.....

but I do not know how to deal with the

if (k == i || k == j) continue;

and with the fact that the lopp depends on each run on "n_n", and by hand I should write the code so many times as the maximal value "n_n" would get.

How do you think it can be fixed?

回答1:

Are you sure the code as written is correct? The current code has undefined behavior if dd is a signed integer type, and the condition in the if is never satisfied if d2 is unsigned or if dd and d2 are floating point types. It looks like you're doing a broken search for the first index k other than i or j where squaring the expression q2_vect[ k]-q1_vect overflows.

As for efficiently skipping the i and j iterations, I would instead just look at where the unrolled "loop" stopped, and restart it at k+1 if k was equal to i or j. This is assuming the code in your loop has no side effects/running total, which is true as written, but I expect you might have meant for the code to do something else (like summing the squares).

Finally, I am highly skeptical of your wish to unroll the loop manually when you don't even seem to have working code to begin with. Any good compiler can unroll the loop for you, but often the type of loop unrolling you're looking to do makes performance worse rather than better. I think you'd do better getting your code to work correctly first, then measuring (and looking at the compiler-generated asm), and only trying to improve on that after you've determined there's a problem.

回答2:

This code as written is fairly unsuitable for SPEs since it's so branch-heavy. Also, information on the types of the variables involved would help; the test as written seems fairly obscure (even with the >0 fix), but the code looks like it might be C++ using some sort of vector class that overloads operator - to mean vector subtraction and operator * of two vectors to compute a dot product.

The first thing to do with such simple loops on SPEs is to get them branch-free (at least the inner loop; i.e. unroll a couple of times and only check for early exit every N iterations) and use SIMD instructions: SPEs only have SIMD instructions, so not using SIMD processing in your loops instantly wastes 75% of your available register space and computational power. Similarly, SPEs can only load aligned qwords (16 bytes) at a time, using smaller data types requires extra work to shuffle the contents of registers around so that the value you're trying to load ends up in the "preferred slot".

You get rid of the if (k == i || k == j) by rewriting the first part of the loop using the following branch-free form (this is pseudocode. It's immediately applicable for ints, but you'll need to use intrinsics to get bitwise ops on floats):

dd = q2_vect[k] - q1_vect;
d2 = dd * dd;
d2 &= ~(cmp_equal(k, i) | cmp_equal(k, j));

Here, cmp_equal corresponds to the respective SPE intrinsics (semantics: cmp_equal(a,b) == (a == b) ? ~0u : 0). This forces d2 to zero when k == i or k == j.

To avoid the if (d2 > 0) branch in the inner loop, do the following:

a |= cmp_greater(d2, 0);

and only check if a is nonzero (to early-out) every few loop iterations. If all values computed for d2 are nonnegative (will be the case if your type is ints, floats or a real-valued vector class), you can simplify this further. Just do:

a |= d2;

In the end, a will only be nonzero if all of the individual terms were nonzero. But be careful with integer overflows (if you're using ints) and NaNs (if you're using floats). If you have to handle these cases, the above simplification will break the code.

回答3:

Usually loop unrolling means making the loop contain a few iterations, such that it is run fewer times. For example,

for(i=0;i<count;i++) {
    printf("%d", i);
}

could be unrolled to

i=0;
if(count%2==1) {
    printf("%d", i);
    i=1;
}
while(i<count) {
    printf("%d", i);
    printf("%d", i+1);
    i+=2;
}

回答4:

For the first problem, you need to not "execute" the loop body when the condition is met. For this particular problem, you can just place the logical negation of that condition inside the if statement's condition.

Normally unrolling is by a factor; the unrolled code still lives in a loop (unless the loop bounds are known to be very small). Furthermore, you'll need to do the "remainder" of the work (corresponding to the remainder of the problem size divided by the unroll factor) outside the loop.

So, an example of loop unrolling:

for (i = 0; i < n; ++i) do_something(i);

can be unrolled by a factor of 2 to:

for (i = 0; i < n-1; i += 2) { do_something(i); do_something(i+1); }
for (; i < n; ++i) do_something(i);

where the second loop does the "remainder" (it also sets i to be the same thing as the unrolled loop would have, but if i is not needed after this, the whole line can just be if (i < n) etc for this case).

回答5:

assuming n_n is a compile-time constant, the loop may be trivially unrolled like so:

do
{ 
  k=0
  if (k == i || k == j) 
    ;
  else
  {
    dd=q2_vect[ k]-q1_vect;
    d2=dd*dd;
    if (d2<0)
    {
      a=1;
      break;
    }
  }

  k=1
  if (k == i || k == j) 
    ;
  else
  {
    dd=q2_vect[ k]-q1_vect;
    d2=dd*dd;
    if (d2<0)
    {
      a=1;
      break;
    }
  }

  /* and so on, n_n times */

  k= n_n-1
  if (k == i || k == j) 
    ;
  else
  {
    dd=q2_vect[ k]-q1_vect;
    d2=dd*dd;
    if (d2<0)
    {
      a=1;
      break;
    }
  }

} while (0);

essentially, everything after the continue goes into the else portion of an if statement

Edit: since n_n is not a compile time constant, you can still unroll the loop by doing several runs through the loop in a loop, and then finish with a switch-case statement. in fact, you can combine them like so, this is called the Duff's device.

#define LOOP_BODY              \
do{                            \  
  if (k == i || k == j)        \
    ;                          \
  else                         \
  {                            \
    dd=q2_vect[ k]-q1_vect;    \
    d2=dd*dd;                  \
    if (d2<0)                  \
    {                          \
      a=1;                     \
      break;                   \
    }                          \
  } while (0)          


k = 0;
switch(n_n % 8)
{
  case 0: for (; k < n_n; k++) { LOOP_BODY; k++; 
  case 7:                        LOOP_BODY; k++;
  case 6:                        LOOP_BODY; k++;
  case 5:                        LOOP_BODY; k++;
  case 4:                        LOOP_BODY; k++;
  case 3:                        LOOP_BODY; k++;
  case 2:                        LOOP_BODY; k++;    
  case 1:                        LOOP_BODY; k++;}
}

回答6:

Unrolling this loop will not help here much. Inner loop software unrolling helps with software pipelining of instructions to achieve higher IPC at run-time. Here it could corrupt the logic by unrolling.

来源：https://stackoverflow.com/questions/4298194/efficient-by-hand-loop-unrolling

标签

optimization

loops

unroll