Can anyone suggest typical scenarios where Partitioner
class introduced in .NET 4.0 can/should be used?
Range partition, as suggested by Brian Rasmussen, is one type of partitioning that should be used when the work is CPU intensive, tends to be small (relative to a virtual method call), many elements must be processed, and is mostly constant when it comes to run time per element.
The other type of partition that should be considered is chunk partitioning. This type of partitioning is also known as a load-balancing algorithm because a worker thread will rarely sit idle while there is more work to do - which is not the case for a range partition.
A chunk partition should be used when the work has some wait states, tends to require more processing per element, or each element can have significantly different work processing times.
One example of this might be reading into memory and processing of 100 files with vastly different sizes. A 1K file will be processed in much less time than a 1mb file. If a range partition is used for this, then some threads could sit idle for some time because they happened to process smaller files.
Unlike a range partition, there is no way to specify the number of elements to be processed by each task - unless you write your own custom partitioner. Another downside to using a chunk partition is that there may be some contention when it goes back to get another chunk since an exclusive lock is used at that point. So, clearly a chunk partition should not be used for short amounts of CPU intensive work.
The default chunk partitioner starts off with a chunk size of 1 element per chunk. After each thread processes three 1-element chunks, the chunk size is incremented to 2 elements per chunk. After three 2-element chunks have been processed by each thread, then the chunk size is incremented again to 3 elements per chunk, and so on. At least this is the way it works according to Dixin Yan, (see the Chunk partitioning section) who works for Microsoft.
By the way, the nice visualizer tool in his blog appears to be the Concurrency Visualizer profile tool. The docs for this tool claim that it can be used to locate performance bottlenecks, CPU under-utilization, thread contention, cross-core thread migration, synchronization delays, DirectX activity, areas of overlapped I/O, and other information. It provides graphical, tabular, and textual data views that show the relationships between the threads in an app and the system as a whole.
Other resources:
MSDN: Custom Partitioners for PLINQ and TPL
Part 5: Parallel Programming - Optimizing PLINQ by Joseph Albahari