I’ve run into something strange about the effect of large memory allocations on the scalability of the .Net runtime. In my test application I create lots of strings in a tig
The effect of a memory allocator on application speedup is more closely related to the number of allocations than the amount allocated. It's also more influenced by the allocation latency (amount of time to complete a single allocation on a single thread), which in the case of the CLR is extremely fast due to the use of a bump-pointer allocator (see section 3.4.3).
Your question is asking why the actual speedup is sublinear, and to answer that you should certainly review Amdahl's Law.
Going back to the Notes on the CLR Garbage Collector, you can see that an allocation context belongs to a particular thread (section 3.4.1), which reduces (but does not eliminate) the amount of synchronization required during multi-threaded allocations. If you find that allocation is truly the weak point, I would suggest trying an object pool (possibly per-thread) to reduce the load on the collector. By reducing the sheer number of allocations, you'll reduce the number of times the collector has to run. However, this will also result in more objects making it to generation 2, which is the slowest to collect when it is needed.
Finally, Microsoft continues to improve the garbage collector in newer versions of the CLR, so you should target the most recent version you are able to (.NET 2 at a bare minimum).
You may want to look that this question of mine.
I ran into a similar problem that was due to the fact that the CLR performs inter-thread synchronization when allocating memory to avoid overlapping allocations. Now, with the server GC, the locking algorithm may be different - but something along those same lines may be affecting your code.
Your initial post is fundamentally flawed - you're assuming that a linear speedup is possible through parallel execution. It isn't, and never has been. See Amdahl's Law (Yes, I know, Wikipedia, but its easier than anything else).
Your code, viewed from the abstraction the CLR provides, appears to have no dependencies - however, as LBushkin pointed, out that isn't the case. As SuperMagic pointed out, the hardware itself implies dependencies between the threads of execution. This is true of just about any problem that can be parallelized - even with independent machines, with independent hardware, some portion of the problem usually requires some element of synchronization, and that synchronization prevents a linear speedup.
The hardware you're running this on is not capable of linear scaling of multiple processes or threads.
You have a single memory bank. that's a bottle neck (multiple channel memory may improve access, but not for more precess than you have memory banks (seems like the e5320 processor support 1 - 4 memory channels).
There is only one memory controller per physical cpu package (two in your case), that's a bottle neck.
There are 2 l2 caches per cpu package. that's a bottle neck. Cache coherency issues will happen if that cache is exhausted.
this doesn't even get to the OS/RTL/VM issues in managing process scheduling and memory management, which will also contribute to non-linear scaling.
I think you're getting pretty reasonable results. Significant speedup with multiple threads and at each increment to 8...
Truely, have you ever read anything to suggest that commodity multi-cpu hardware is capable of linear scaling of multiple processes/threads? I haven't.
Great question Luke! I'm very interested in the answer.
I suspect that you were not expecting linear scaling, but something better than a 39% variance.
NoBugz - Based on 280Z28's links, there would actually be a GC heap per core with GCMode=Server. There should also be a GC thread per heap. This shouldn't result in the concurrency issues you mention?
ran into a similar problem that was due to the fact that the CLR performs inter-thread synchronization when allocating memory to avoid overlapping allocations. Now, with the server GC, the locking algorithm may be different - but something along those same lines may be affecting your code
LBushkin - I think that that is the key question, does GCMode=Server still cause inter-thread locking when allocating memory? Anyone know - or can it simply be explained by hardware limitations as mentioned by SuperMagic?