I recently benchmarked the .NET 4 garbage collector, allocating intensively from several threads. When the allocated values were recorded in an array, I observed no scalability
I can hazard a couple of guesses as to what is happening.
(1) If you have a single thread and there is M space free in generation 0, then the GC will only run once M bytes have been allocated.
(2) If you have N threads and the GC divides up generation 0 into N/M space per thread, the GC will end up running every time a thread allocates N/M bytes. The showstopper here is that the GC needs to "stop the world" (i.e., suspend all running threads) in order to mark references from the threads' root sets. This is not cheap. So, not only will the GC run more often, it will be doing more work on each collection.
The other problem, of course, is that multi-threaded applications aren't typically very cache friendly, which can also put a significant dent in your performance.
I don't think this is a .NET GC issue, rather it's an issue with GC in general. A colleague once ran a simple "ping pong" benchmark sending simple integer messages between two threads using SOAP. The benchmark ran twice as fast when the two threads were in separate processes because memory allocation and management was completely decoupled!