Associativity gives us parallelizability. But what does commutativity give?

前端 未结 2 2020
一个人的身影
一个人的身影 2020-12-06 00:40

Alexander Stepanov notes in one of his brilliant lectures at A9 (highly recommended, by the way) that the associative property gives us parallelizability

相关标签:
2条回答
  • 2020-12-06 01:07

    Here is a more abstract answer with less emphasis on instruction level parallelism and more on thread level parallelism.

    A common objective in parallelism is to do a reduction of information. A simple example is the dot product of two arrays

    for(int i=0; i<N; i++) sum += x[i]*[y];
    

    If the operation is associative then we can have each thread calculate a partial sum. Then the finally sum is the sum of each partial sum.

    If the operation is commutative the final sum can be done in any order. Otherwise the partial sums have to be summed in order.

    One problem is that we can't have multiple threads writing to the final sum at the same time otherwise it creates a race condition. So when one thread writes to the final sum the others have to wait. Therefore, summing in any order can be more efficient because it's often difficult to have each thread finish in order.


    Let's choose an example. Let's say there are two threads and therefore two partial sums.

    If the operation is commutative we could have this case

    thread2 finishes its partial sum
    sum += thread2's partial sum
    thread2 finishes writing to sum   
    thread1 finishes its partial sum
    sum += thread1's partial sum
    

    However if the operation does not commute we would have to do

    thread2 finishes its partial sum
    thread2 waits for thread1 to write to sum
    thread1 finishes its partial sum
    sum += thread1's partial sum
    thread2 waits for thread1 to finish writing to sum    
    thread1 finishes writing to sum   
    sum += thread2's partial sum
    

    Here is an example of the dot product with OpenMP

    #pragma omp parallel for reduction(+: sum)
    for(int i=0; i<N; i++) sum += x[i]*[y];
    

    The reduction clause assumes the operation (+ in this case) is commutative. Most people take this for granted.

    If the operation is not commutative we would have to do something like this

    float sum = 0;
    #pragma omp parallel
    {
        float sum_partial = 0 
        #pragma omp for schedule(static) nowait
        for(int i=0; i<N; i++) sum_partial += x[i]*[y];
        #pragma omp for schedule(static) ordered
        for(int i=0; i<omp_get_num_threads(); i++) {
            #pragma omp ordered
            sum += sum_partial;
        }
    }
    

    The nowait clause tells OpenMP not to wait for each partial sum to finish. The ordered clause tells OpenMP to only write to sum in order of increasing thread number.

    This method does the final sum linearly. However, it could be done in log2(omp_get_num_threads()) steps.

    For example if we had four threads we could do the reduction in three sequential steps

    1. calculate four partial sums in parallel: s1, s2, s3, s4
    2. calculate in parallel: s5 = s1 + s2 with thread1 and s6 = s3 + s4 with thread2
    3. calculate sum = s5 + s6 with thread1

    That's one advantage of using the reduction clause since it's a black box it may do the reduction in log2(omp_get_num_threads()) steps. OpenMP 4.0 allows defining custom reductions. But nevertheless it still assumes the operations are commutative. So it's not good for e.g. chain matrix multiplication. I'm not aware of an easy way with OpenMP to do the reduction in log2(omp_get_num_threads()) steps when the operations don't commute.

    0 讨论(0)
  • 2020-12-06 01:25

    Some architectures, x86 being a prime example, have instructions where one of the sources is also the destination. If you still need the original value of the destination after the operation, you need an extra instruction to copy it to another register.

    Commutative operations give you (or the compiler) a choice of which operand gets replaced with the result. So for example, compiling (with gcc 5.3 -O3 for x86-64 Linux calling convention):

    // FP: a,b,c in xmm0,1,2.  return value goes in xmm0
    // Intel syntax ASM is  op  dest, src
    // sd means Scalar Double (as opposed to packed vector, or to single-precision)
    double comm(double a, double b, double c) { return (c+a) * (c+b); }
        addsd   xmm0, xmm2
        addsd   xmm1, xmm2
        mulsd   xmm0, xmm1
        ret
    double hard(double a, double b, double c) { return (c-a) * (c-b); }
        movapd  xmm3, xmm2    ; reg-reg copy: move Aligned Packed Double
        subsd   xmm2, xmm1
        subsd   xmm3, xmm0
        movapd  xmm0, xmm3
        mulsd   xmm0, xmm2
        ret
    double easy(double a, double b, double c) { return (a-c) * (b-c); }
        subsd   xmm0, xmm2
        subsd   xmm1, xmm2
        mulsd   xmm0, xmm1
        ret
    

    x86 also allows using memory operands as a source, so you can fold loads into ALU operations, like addsd xmm0, [my_constant]. (Using an ALU op with a memory destination sucks: it has to do a read-modify-write.) Commutative operations give more scope for doing this.

    x86's avx extension (in Sandybridge, Jan 2011) added non-destructive versions of every existing instruction that used vector registers (same opcodes but with a multi-byte VEX prefix replacing all the previous prefixes and escape bytes). Other instruction-set extensions (like BMI/BMI2) also use the VEX coding scheme to introduce 3-operand non-destructive integer instructions, like PEXT r32a, r32b, r/m32: Parallel extract of bits from r32b using mask in r/m32. Result is written to r32a.

    AVX also widened the vectors to 256b and added some new instructions. It's unfortunately nowhere near ubiquitous, and even Skylake Pentium/Celeron CPUs don't support it. It will be a long time before it's safe to ship binaries that assume AVX support. :(

    Add -march=native to the compile options in the godbolt link above to see that AVX lets the compiler use just 3 instructions even for hard(). (godbolt runs on a Haswell server, so that includes AVX2 and BMI2):

    double hard(double a, double b, double c) { return (c-a) * (c-b); }
            vsubsd  xmm0, xmm2, xmm0
            vsubsd  xmm1, xmm2, xmm1
            vmulsd  xmm0, xmm0, xmm1
            ret
    
    0 讨论(0)
提交回复
热议问题