In my application I execute from couple of dozens to couple of hundreds actions in parallel (no return value for the actions).
Which approach would be the most optimal:<
I have used the tests from StriplingWarror to find out where the difference does come from. I did this because when i do look with Reflector at the code the class Parallel does nothing different than creating a bunch of tasks and let them run.
From a theoretical point of view both approaches should be equivalent in terms of run time. But as the (not very realistic) tests with an empty action did show that the Parallel class is much faster.
The task version does spend nearly all its time with creating new tasks which does lead to many garbage collections. The speed difference you see is purely due to the fact that you create many tasks which quickly become garbage.
The Parallel class instead does create its own task derived class which does run concurrently on all CPUs. There is only one phyiscal task running at all cores. The synchronization does happen inside the task delegate now which does explain the much faster speed of the Parallel class.
ParallelForReplicatingTask task2 = new ParallelForReplicatingTask(parallelOptions, delegate {
for (int k = Interlocked.Increment(ref actionIndex); k <= actionsCopy.Length; k = Interlocked.Increment(ref actionIndex))
{
actionsCopy[k - 1]();
}
}, TaskCreationOptions.None, InternalTaskOptions.SelfReplicating);
task2.RunSynchronously(parallelOptions.EffectiveTaskScheduler);
task2.Wait();
So what is better then? The best task is the task which is never run. If you need to create so many tasks that they become a burden to the garbage collector you should stay away from the task APIs and stick the the Parallel class which gives you direct parallel execution at all cores without new tasks.
If you need to become even faster it might be that creating threads by hand and use hand optimized data structures to give you maximum speed for your access pattern is the most performant solution. But it is unlikely that you will succeed in doing so because the TPL and Parallel APIs are already heavily tuned. Usually you need to use one of the many overloads to configure your running tasks or Parallel class to achieve the same with much less code.
But if you have a non standard threading pattern it might be that you are better off without using TPL to get most out of your cores. Even Stephen Toub did mention that the TPL APIs were not designed for ultra fast performance but the main goal was to make threading easier for the "average" programmer. To beat the TPL in specific cases you need to be well above average and you need to know a lot of stuff about CPU cache lines, thread scheduling, memory models, JIT code generation, ... to come up in your specific scenario with something better.