Parallel.For() with Interlocked.CompareExchange(): poorer performance and slightly different results to serial version

懵懂的女人 提交于 2019-12-18 09:39:10

问题


I experimented with calculating the mean of a list using Parallel.For(). I decided against it as it is about four times slower than a simple serial version. Yet I am intrigued by the fact that it does not yield exactly the same result as the serial one and I thought it would be instructive to learn why.

My code is:

public static double Mean(this IList<double> list)
{
        double sum = 0.0;


        Parallel.For(0, list.Count, i => {
                        double initialSum;
                        double incrementedSum;
                        SpinWait spinWait = new SpinWait();

                        // Try incrementing the sum until the loop finds the initial sum unchanged so that it can safely replace it with the incremented one.
                        while (true) {
                            initialSum = sum;
                            incrementedSum = initialSum + list[i];
                            if (initialSum == Interlocked.CompareExchange(ref sum, incrementedSum, initialSum)) break;
                            spinWait.SpinOnce();
                        }
                     });

        return sum / list.Count;
    }

When I run the code on a random sequence of 2000000 points, I get results that are different in the last 2 digits to the serial mean.

I searched stackoverflow and found this: VB.NET running sum in nested loop inside Parallel.for Synclock loses information. My case, however, is different to the one described there. There a thread-local variable temp is the cause of inaccuracy, but I use a single sum that is updated (I hope) according to the textbook Interlocked.CompareExchange() pattern. The question is of course moot because of the poor performance (which surprises me, but I am aware of the overhead), yet I am curious whether there is something to be learnt from this case.

Your thoughts are appreciated.


回答1:


Using double is the underlying problem, you can feel better about the synchronization not being the cause by using long instead. The results you got are in fact correct but that never makes a programmer happy.

You discovered that floating point math is communicative but not associative. Or in other words, a + b == b + a but a + b + c != a + c + b. Implicit in your code that the order in which the numbers are added is quite random.

This C++ question talks about it as well.




回答2:


The accuracy issue is very well addressed in the other answers so I won't repeat it here, other that to say never trust the low bits of your floating point values. Instead I'll try to explain the performance hit you're seeing and how to avoid it.

Since you haven't shown your sequential code, I'll assume the absolute simplest case:

double sum = list.Sum();

This is a very simple operation that should work about as fast as it is possible to go on one CPU core. With a very large list it seems like it should be possible to leverage multiple cores to sum the list. And, as it turns out, you can:

double sum = list.AsParallel().Sum();

A few runs of this on my laptop (i3 with 2 cores/4 logical procs) yields a speedup of about 2.6 times over multiple runs against 2 million random numbers (same list, multiple runs).

Your code however is much, much slower than the simple case above. Instead of simply breaking the list into blocks that are summed independently and then summing the results you are introducing all sorts of blocking and waiting in order to have all of the threads update a single running sum.

Those extra waits, the much more complex code that supports them, creating objects and adding more work for the garbage collector all contribute to a much slower result. Not only are you wasting a whole lot of time on each item in the list but you are essentially forcing the program to do a sequential operation by making it wait for the other threads to leave the sum variable alone long enough for you to update it.

Assuming that the operation you are actually performing is more complex than a simple Sum() can handle, you may find that the Aggregate() method is more useful to you than Parallel.For.

There are several overloads of the Aggregate extension, including one that is effectively a Map Pattern implementation, with similarities to how bigdata systems like MapReduce work. Documentation is here.

This version of Aggregate uses an accumulator seed (the starting value for each thread) and three functions:

  1. updateAccumulatorFunc is called for each item in the sequence and returns an updated accumulator value

  2. combineAccumulatorsFunc is used to combine the accumulators from each partition (thread) in your parallel enumerable

  3. resultSelector selects the final output value from the accumulated result.

A parallel sum using this method looks something like this:

double sum = list.AsParallel().Aggregate(
    // seed value for accumulators
    (double)0, 
    // add val to accumulator
    (acc, val) => acc + val,
    // add accumulators
    (acc1, acc2) => acc1 + acc2,
    // just return the final accumulator
    acc => acc
);

For simple aggregations that works fine. For a more complex aggregate that uses an accumulator that is non-trivial there is a variant that accepts a function that creates accumulators for the initial state. This is useful for example in an Average implementation:

public class avg_acc
{
    public int count;
    public double sum;
}

public double ParallelAverage(IEnumerable<double> list)
{
    double avg = list.AsParallel().Aggregate(
        // accumulator factory method, called once per thread:
        () => new avg_acc { count = 0, sum = 0 },
        // update count and sum
        (acc, val) => { acc.count++; acc.sum += val; return acc; },
        // combine accumulators
        (ac1, ac2) => new avg_acc { count = ac1.count + ac2.count, sum = ac1.sum + ac2.sum },
        // calculate average
        acc => acc.sum / acc.count
    );
    return avg;
}

While not as fast as the standard Average extension (~1.5 times faster than sequential, 1.6 times slower than parallel) this shows how you can do quite complex operations in parallel without having to lock outputs or wait on other threads to stop messing with them, and how to use a complex accumulator to hold intermediate results.



来源:https://stackoverflow.com/questions/36308056/parallel-for-with-interlocked-compareexchange-poorer-performance-and-slight

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!