I\'m currently testing out Parallel for C#. Generally it works fine, and using parallel is faster than the normal foreach loops. However, at times (like 1 out of 5 times), my
First of all, careful with Parallel
- it doesn't shield you from thread safety issues. In your original code, you used non-thread-safe code when filling the list of results. In general, you want to avoid sharing any state (although the read-only access to the list is fine in a case like this). If you really want to use Parallel.For
or Parallel.ForEach
for filtering and aggregation (really, AsParallel
is what you want in those cases), you should use the overload with thread-local state - you'd do the final results aggregation in the localFinally
delegate (note that it's still run on a different thread, so you need to ensure thread-safety; however, locking is fine in that case, since you're only doing this once per thread, rather than on every iteration).
Now, the obvious first thing to try in a problem like this is to use a profiler. So I've done that. The results are as follows:
Parallel.For
or Parallel.ForEach
bodies themselves, not in your code (the simple if (data[i] == 1) results.Add(data[i])
).The first means we can say GC probably isn't the culprit. Indeed, it doesn't get any chance to run. The second is more curious - it means that in some cases, the overhead of Parallel
is way out of line - but it's seemingly random, sometimes it works without a hitch, and sometimes it takes half a second. This would usually point to the GC, but we've ruled that out already.
I've tried using the overload without the loop state, but that didn't help. I've tried limiting the MaxDegreeOfParallelism
, but it only ever hurt things. Now, obviously, this code is absolutely dominated by cache access - there's hardly any CPU work and no I/O - which will always favour a single-threaded solution; but even using MaxDegreeOfParallelism
of 1 doesn't help - indeed, 2 seems to be the fastest on my system. More is useless - again, cache access dominates. It's still curious - I'm using a server CPU for the tests, which has plenty of cache for all of the data at once, and while we're not doing a 100% sequential access (which pretty much gets rid of the latency entirely), it should be quite sequential enough. Regardless, we have the base line of memory throughput in the single-threaded solution, and it's very close to the speed of the parallelised case when it works well (parallelised, I'm reading 40% less runtime than single-threaded, on a four-core server CPU for a embarassingly parallel problem - again, obviously, memory access is the limit).
So, it's time to check the reference source for Parallel.For
. In a case like this, it simply creates ranges based on the amount of workers - one range for each. So it's not the ranges - there's no overhead from that.
The core simply runs a task that iterates over the given range. There's a few interesting bits - for example, the task will get "suspended" if it takes too long. However, it doesn't seem to fit the data too well - why would something like this cause random delays unrelated to the data size? No matter how small the work job, and no matter how low MaxDegreeOfParallelism
is, we get "random" slowdowns. It might be a problem, but I have no idea how to check it.
The most interesting thing is that expanding the test data does nothing with the anomaly - while it makes the "good" parallel runs much faster (even getting close to perfect efficiency in my tests, oddly enough), the "bad" ones are still just as bad. In fact, in a few of my test runs, they are absurdly bad (up to ten times the "normal" loop).
So, let's have a look at the threads. I've artifically bumped the amount of threads in the ThreadPool
to ensure that expanding the thread pool isn't a bottleneck (it shouldn't if everything worked well, but...). And here comes the first surprise - while the "good" runs simply use the 4-8 threads that make sense, the "bad" runs expand over all the available threads in the pool, even if there's a hundred of them. Oops?
Let's dive into source code once again. Parallel
internally uses Task.RunSynchronously
to run the root partitioned work job, and Wait
s on the result. When I look at parallel stacks, there's 97 threads executing the loop body, and only one that actually has RunSynchronously
on stack (as expected - that's the main thread). The others are plain threadpool threads. The task IDs also tell a story - there's thousands of individual tasks being created while doing the iteration. Obviously, something is very wrong here. Even if I remove the whole loop body, this still happens, so it's not some closure weirdness either.
Explicitly setting MaxDegreeOfParallelism
offsets this somewhat - the amount of threads used doesn't explode anymore - however, the amount of tasks still does. But we've already seen that the ranges are just the amount of parallel tasks running - so why keep creating more and more tasks? Using the debugger confirms this - with MaxDOP of four, there's only five ranges (there's some alignment that causes the fifth range). Interestingly, one of the completed ranges (how did the first one finish so much ahead of the rest?) has index higher than the range it iterates - this is because the "scheduler" assigns range-partitions in slices of up to 16.
The root task is self-replicating, so instead of explicitly starting e.g. four tasks to handle the data, it waits for the scheduler to replicate the task to handle more data. It's kind of hard to read - we're talking about complex multi-threaded lock-less code, but it seems that it always assigns work in slices much smaller than the partitioned ranges. In my testing, the maximal size of the slice was 16 - a far cry from the millions of data I'm running. 16 iterations with a body like this is no time at all, which might lead to many issues with the algorithm (the biggest being the infrastructure taking more CPU work than the actual iterator body). In some cases, cache trashing might impact performance even further (perhaps when there's a lot of variation in the body runtimes), but most of the time, the access is sequential enough.
TL; DR
Don't use Parallel.For
and Parallel.ForEach
if your work-per-iteration is very short (on the order of milliseconds). AsParallel
or just running the iteration single-threaded will most likely be much faster.
Slightly longer explanation:
It seems that Parallel.For
and Paraller.ForEach
are designed for scenarios where the individual items you're iterating over take a substantial amount of time to execute (i.e. lots of work per item, not tiny amounts of work over a lot of items). They seem to perform badly when the iterator body is too short. If you're not doing substantial work in the iterator body, use AsParallel
instead of Parallel.*
. The sweetspot seems to be somewhere under 150ms per slice (around 10ms per iteration). Otherwise, Parallel.*
will spend tons of time in its own code, and hardly any time doing your iteration (in my case, the usual number was somewhere around 5-10% in the body - embarassingly bad).
Sadly, I didn't find any warning about this on MSDN - there's even samples going over substantial amounts of data, but there's no hint at the terrible performance hit of doing so. Testing the very same sample code on my computer, I've found that it is indeed often slower than a single-threaded iteration, and at the best of times, barely faster (around 30-40% time savings while running on four CPU cores - not very efficient).
EDIT:
Willaien found a mention on MSDN about this very issue, and how to solve it - https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx. The idea is to use a custom partitioner and iterate over that in the Parallel.For
body (e.g. loop in Parallel.For
's loop). However, for most cases, using AsParallel
is probably still a better choice - simple loop bodies usually mean some kind of a map/reduce operation, and AsParallel
and LINQ in general are great at that. For example, your sample code could be rewritten simply as:
var result = testData.AsParallel().Where(i => i == 1).ToList();
The only case where using AsParallel
is a bad idea is the same as with all the other LINQs - when your loop body has side-effects. Some might be tolerable, but it's safer to avoid them altogether.