While investigating the subtler consequences of generational garbage collectors on application performance, I have hit a quite staggering discrepancy in the performance of a very basic operation – a simple write to a heap location – with respect to whether the value written is primitive or a reference.
The microbenchmark
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 1, time = 1)
@Measurement(iterations = 3, time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(2)
public class Writing
{
static final int TARGET_SIZE = 1024;
static final int[] primitiveArray = new int[TARGET_SIZE];
static final Object[] referenceArray = new Object[TARGET_SIZE];
int val = 1;
@GenerateMicroBenchmark
public void fillPrimitiveArray() {
final int primitiveValue = val++;
for (int i = 0; i < TARGET_SIZE; i++)
primitiveArray[i] = primitiveValue;
}
@GenerateMicroBenchmark
public void fillReferenceArray() {
final Object referenceValue = new Object();
for (int i = 0; i < TARGET_SIZE; i++)
referenceArray[i] = referenceValue;
}
}
The results
Benchmark Mode Thr Cnt Sec Mean Mean error Units
fillPrimitiveArray avgt 1 6 1 87.891 1.610 nsec/op
fillReferenceArray avgt 1 6 1 640.287 8.368 nsec/op
Since the whole loop is almost 8 times slower, the write itself is probably more than 10 times slower. What could possibly explain such a slowdown?
The speed of writing out the primitive array is more than 10 writes per nanosecond. Perhaps I should ask the flip-side of my question: what makes primitive writing so fast? (BTW I've checked, the times scale linearly with array size.)
Note that this is all single-threaded; specifying @Threads(2)
will increase both measurements, but the ratio will be similar.
A bit of background: the card table and the associated write barrier
An object in the Young Generation could happen to be reachable only from an object in the Old Generation. To avoid collecting live objects, the YG collector must know about any references that were written to the Old Generation area since the last YG collection. This is achieved with a sort of "dirty flag table", called the card table, which has one flag for each block of 512 bytes of heap.
The "ugly" part of the scheme comes when we realize that each and every write of a reference must be accompanied by a card table invariant-maintaining piece of code: the location in the card table which guards the address being written to must be marked as dirty. This piece of code is termed the write barrier.
In specific machine code, this looks as follows:
lea edx, [edi+ebp*4+0x10] ; calculate the heap location to write
mov [edx], ebx ; write the value to the heap location
shr edx, 9 ; calculate the offset into the card table
mov [ecx+edx], ah ; mark the card table entry as dirty
And this is all it takes for the same high-level operation when the value written is primitive:
mov [edx+ebx*4+0x10], ebp
The write barrier appears to contribute "just" one more write, but my measurements show that it causes an order-of-magnitude slowdown. I can't explain this.
UseCondCardMark
just makes it worse
There is a quite obscure JVM flag which is supposed to avoid the card table write if the entry is already marked dirty. This is important primarily in some degenerate cases where a lot of card table writing causes false sharing between threads via CPU caches. Anyway, I tried with that flag on:
with -XX:+UseCondCardMark:
Benchmark Mode Thr Cnt Sec Mean Mean error Units
fillPrimitiveArray avgt 1 6 1 89.913 3.586 nsec/op
fillReferenceArray avgt 1 6 1 1504.123 12.130 nsec/op
Quoting the authoritative answer provided by Vladimir Kozlov at hotspot-compiler-dev
mailing list:
Hi Marko,
For primitive arrays we use handwritten assembler code which use XMM registers as vectors for initialization. For object arrays we did not optimize it because it is not common case. We can improve it similar to what we did for arracopy but we decided leave it for now.
Regards,
Vladimir
I have also wondered why the optimized code is not inlined, and got that answer as well:
The code is not small, so we decided to not inline it. Look on MacroAssembler::generate_fill() in macroAssembler_x86.cpp:
My original answer:
I missed an important bit in the machine code, apparently because I was looking at the On-Stack Replacement version of the compiled method instead of the one used for subsequent calls. It turns out that HotSpot was able to prove that my loop amounts to what a call to Arrays.fill
would have done and replaced the entire loop with a call
instruction to such code. I can't see that function's code, but it probably uses every possible trick, such as MMX instructions, to fill a block of memory with the same 32-bit value.
This gave me the idea to measure the actual Arrays.fill
calls. I got more surprise:
Benchmark Mode Thr Cnt Sec Mean Mean error Units
fillPrimitiveArray avgt 1 5 2 155.343 1.318 nsec/op
fillReferenceArray avgt 1 5 2 682.975 17.990 nsec/op
loopFillPrimitiveArray avgt 1 5 2 156.114 0.523 nsec/op
loopFillReferenceArray avgt 1 5 2 682.209 7.047 nsec/op
The results with a loop and with a call to fill
are identical. If anything, this is even more confusing than the results which motivated the question. I would have at least expected fill
to benefit from the same optimization ideas regardless of value type.
来源:https://stackoverflow.com/questions/21523958/what-can-explain-the-huge-performance-penalty-of-writing-a-reference-to-a-heap-l