I recently benchmarked the .NET 4 garbage collector, allocating intensively from several threads. When the allocated values were recorded in an array, I observed no scalability
Very quick, easy to see (straight at root, assigning nulls) and massive releases can trick GC into being eager and the whole idea of a cache-local heap is a nice dream :-) Even if you had fully separated thread-local heaps (which you don't) the handle-pointer table would still have to be fully volatile just to make is safe for general multi-CPU scenarios. Oh and remember that there are many threads, CPU cache is shared, kernel needs take the precedence so it's not all just for you :-)
Also beware that "heap" with double pointers has 2 parts - block of memory to give and the handle-pointer table (so that blocks can be moved but your code always has one address). Such table is a critical but very lean process-level resource and just about the only way to stress it is to flood it with massive quick releases - so you managed to do it :-))
In general the rule of GC is - leak :-) Not forever of course, but kind of for as long as you can. If you remember how people go around telling "don't force GC collections"? That's the part of the story. Also the "stop the world" collection is actually much more efficient than "concurrent" and used to be known by a nicer name of cycle stealing or sheduler cooperation. Only the mark phase needs to freeze the scheduler and on a server there's a burst of several threads doing it (N cores are idle anyway :-) The only reason for the other one is that it can make real-time operations like playing videos jittery, just as the longer thread quantum does.
So again if you go competing with infrastructure on short and frequent CPU bursts (small alloc, almost no work, quick release) the only thing you'll see/measure will be the GC and JIT noise.
If this was for something real, i.e. not just experimenting, the best you can do is to use big value arrays on stack (structs). They can't be forced onto heap and are as local as a local can get, and not subject to any backdoor moving => cache has to love them :-) That may mean switching to "unsafe" mode, using normal pointers and maybe doing a bit of alloc on your own (if yopu need something simple like lists) but that's a small price to pay for kicking GC out :-) Trying to force data into a cache also depends on keeping your stacks lean otherwise - rememeber that you are not alone. Also giving your threads some work that's worth at least several quantums berween releases may help. Worst case scenario would be if you alloc and release within a signle quantum.