System.arraycopy with constant length

I'm playing around with JMH ( http://openjdk.java.net/projects/code-tools/jmh/ ) and I just stumbled on a strange result.

I'm benchmarking ways to make a shallow copy of an array and I can observe the expected results (that looping through the array is a bad idea and that there is no significant difference between #clone(), System#arraycopy() and Arrays#copyOf(), performance-wise).

Except that System#arraycopy() is one-quarter slower when the array's length is hard-coded... Wait, what ? How can this be slower ?

Does anyone has an idea of what could be the cause ?

The results (throughput):

# JMH 1.11 (released 17 days ago)
# VM version: JDK 1.8.0_05, VM 25.5-b02
# VM invoker: /Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home/jre/bin/java
# VM options: -Dfile.encoding=UTF-8 -Duser.country=FR -Duser.language=fr -Duser.variant
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time

Benchmark                                            Mode  Cnt         Score         Error  Units
ArrayCopyBenchmark.ArraysCopyOf                     thrpt   20  67100500,319 ±  455252,537  ops/s
ArrayCopyBenchmark.ArraysCopyOf_Class               thrpt   20  65246374,290 ±  976481,330  ops/s
ArrayCopyBenchmark.ArraysCopyOf_Class_ConstantSize  thrpt   20  65068143,162 ± 1597390,531  ops/s
ArrayCopyBenchmark.ArraysCopyOf_ConstantSize        thrpt   20  64463603,462 ±  953946,811  ops/s
ArrayCopyBenchmark.Clone                            thrpt   20  64837239,393 ±  834353,404  ops/s
ArrayCopyBenchmark.Loop                             thrpt   20  21070422,097 ±  112595,764  ops/s
ArrayCopyBenchmark.Loop_ConstantSize                thrpt   20  24458867,274 ±  181486,291  ops/s
ArrayCopyBenchmark.SystemArrayCopy                  thrpt   20  66688368,490 ±  582416,954  ops/s
ArrayCopyBenchmark.SystemArrayCopy_ConstantSize     thrpt   20  48992312,357 ±  298807,039  ops/s

And the benchmark class:

import java.util.Arrays;
import java.util.concurrent.TimeUnit;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;

@State(Scope.Benchmark)
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
public class ArrayCopyBenchmark {

    private static final int LENGTH = 32;

    private Object[] array;

    @Setup
    public void before() {
        array = new Object[LENGTH];
        for (int i = 0; i < LENGTH; i++) {
            array[i] = new Object();
        }
    }

    @Benchmark
    public Object[] Clone() {
        Object[] src = this.array;
        return src.clone();
    }

    @Benchmark
    public Object[] ArraysCopyOf() {
        Object[] src = this.array;
        return Arrays.copyOf(src, src.length);
    }

    @Benchmark
    public Object[] ArraysCopyOf_ConstantSize() {
        Object[] src = this.array;
        return Arrays.copyOf(src, LENGTH);
    }

    @Benchmark
    public Object[] ArraysCopyOf_Class() {
        Object[] src = this.array;
        return Arrays.copyOf(src, src.length, Object[].class);
    }

    @Benchmark
    public Object[] ArraysCopyOf_Class_ConstantSize() {
        Object[] src = this.array;
        return Arrays.copyOf(src, LENGTH, Object[].class);
    }

    @Benchmark
    public Object[] SystemArrayCopy() {
        Object[] src = this.array;
        int length = src.length;
        Object[] array = new Object[length];
        System.arraycopy(src, 0, array, 0, length);
        return array;
    }

    @Benchmark
    public Object[] SystemArrayCopy_ConstantSize() {
        Object[] src = this.array;
        Object[] array = new Object[LENGTH];
        System.arraycopy(src, 0, array, 0, LENGTH);
        return array;
    }

    @Benchmark
    public Object[] Loop() {
        Object[] src = this.array;
        int length = src.length;
        Object[] array = new Object[length];
        for (int i = 0; i < length; i++) {
            array[i] = src[i];
        }
        return array;
    }

    @Benchmark
    public Object[] Loop_ConstantSize() {
        Object[] src = this.array;
        Object[] array = new Object[LENGTH];
        for (int i = 0; i < LENGTH; i++) {
            array[i] = src[i];
        }
        return array;
    }
}

As usual, these kind of questions are quickly answered by studying the generated code. JMH provides you with -prof perfasm on Linux, and -prof xperfasm on Windows. If you run the benchmark on JDK 8u40, then you will see (note I used -bm avgt -tu ns to make scores more comprehensible):

Benchmark                         Mode  Cnt   Score   Error  Units
ACB.SystemArrayCopy               avgt   25  13.294 ± 0.052  ns/op
ACB.SystemArrayCopy_ConstantSize  avgt   25  16.413 ± 0.080  ns/op

Why are these benchmarks perform differently? Let's first do -prof perfnorm to dissect (I dropped the lines that do not matter):

Benchmark                                     Mode  Cnt    Score    Error  Units
ACB.SAC                                       avgt   25   13.466 ±  0.070  ns/op
ACB.SAC:·CPI                                  avgt    5    0.602 ±  0.025   #/op
ACB.SAC:·L1-dcache-load-misses                avgt    5    2.346 ±  0.239   #/op
ACB.SAC:·L1-dcache-loads                      avgt    5   24.756 ±  1.438   #/op
ACB.SAC:·L1-dcache-store-misses               avgt    5    2.404 ±  0.129   #/op
ACB.SAC:·L1-dcache-stores                     avgt    5   14.929 ±  0.230   #/op
ACB.SAC:·LLC-loads                            avgt    5    2.151 ±  0.217   #/op
ACB.SAC:·branches                             avgt    5   17.795 ±  1.003   #/op
ACB.SAC:·cycles                               avgt    5   56.677 ±  3.187   #/op
ACB.SAC:·instructions                         avgt    5   94.145 ±  6.442   #/op

ACB.SAC_ConstantSize                          avgt   25   16.447 ±  0.084  ns/op
ACB.SAC_ConstantSize:·CPI                     avgt    5    0.637 ±  0.016   #/op
ACB.SAC_ConstantSize:·L1-dcache-load-misses   avgt    5    2.357 ±  0.206   #/op
ACB.SAC_ConstantSize:·L1-dcache-loads         avgt    5   25.611 ±  1.482   #/op
ACB.SAC_ConstantSize:·L1-dcache-store-misses  avgt    5    2.368 ±  0.123   #/op
ACB.SAC_ConstantSize:·L1-dcache-stores        avgt    5   25.593 ±  1.610   #/op
ACB.SAC_ConstantSize:·LLC-loads               avgt    5    1.050 ±  0.038   #/op
ACB.SAC_ConstantSize:·branches                avgt    5   17.853 ±  0.697   #/op
ACB.SAC_ConstantSize:·cycles                  avgt    5   66.680 ±  2.049   #/op
ACB.SAC_ConstantSize:·instructions            avgt    5  104.759 ±  4.831   #/op

So, ConstantSize somehow does more L1-dcache-stores, but one less LLC-load. Hm, so that's what we are looking for, more stores in the constant case. -prof perfasm conveniently highlights the hot parts in assembly:

default:

  4.32%    6.36%   0x00007f7714bda2dc: movq   $0x1,(%rax)            ; alloc
  0.09%    0.04%   0x00007f7714bda2e3: prefetchnta 0x100(%r9)
  2.95%    1.48%   0x00007f7714bda2eb: movl   $0xf80022a9,0x8(%rax)
  0.38%    0.18%   0x00007f7714bda2f2: mov    %r11d,0xc(%rax)
  1.56%    3.02%   0x00007f7714bda2f6: prefetchnta 0x140(%r9)
  4.73%    2.71%   0x00007f7714bda2fe: prefetchnta 0x180(%r9)

ConstantSize:

  0.58%    1.22%   0x00007facf921132b: movq   $0x1,(%r14)            ; alloc
  0.84%    0.72%   0x00007facf9211332: prefetchnta 0xc0(%r10)
  0.11%    0.13%   0x00007facf921133a: movl   $0xf80022a9,0x8(%r14)
  0.21%    0.68%   0x00007facf9211342: prefetchnta 0x100(%r10)
  0.50%    0.87%   0x00007facf921134a: movl   $0x20,0xc(%r14)
  0.53%    0.82%   0x00007facf9211352: mov    $0x10,%ecx
  0.04%    0.14%   0x00007facf9211357: xor    %rax,%rax
  0.34%    0.76%   0x00007facf921135a: shl    $0x3,%rcx
  0.50%    1.17%   0x00007facf921135e: rex.W rep stos %al,%es:(%rdi) ; zeroing
 29.49%   52.09%   0x00007facf9211361: prefetchnta 0x140(%r10)
  1.03%    0.53%   0x00007facf9211369: prefetchnta 0x180(%r10)

So there is that pesky rex.W rep stos %al,%es:(%rdi) consuming a significant time. This zeroes the newly allocated array. In ConstantSize test, the JVM could not correlate that you are overwriting the entire target array, and so it had to pre-zero it before diving into the actual array copy.

If you look at the generated code on JDK 9b82 (the latest available), then you will see it folds both patterns in non-zeroed copy, as you can see with -prof perfasm, and can also confirm with -prof perfnorm:

Benchmark                                     Mode  Cnt    Score    Error  Units
ACB.SAC                                       avgt   50   14.156 ±  0.492  ns/op
ACB.SAC:·CPI                                  avgt    5    0.612 ±  0.144   #/op
ACB.SAC:·L1-dcache-load-misses                avgt    5    2.363 ±  0.341   #/op
ACB.SAC:·L1-dcache-loads                      avgt    5   28.350 ±  2.181   #/op
ACB.SAC:·L1-dcache-store-misses               avgt    5    2.287 ±  0.607   #/op
ACB.SAC:·L1-dcache-stores                     avgt    5   16.922 ±  3.402   #/op
ACB.SAC:·branches                             avgt    5   21.242 ±  5.914   #/op
ACB.SAC:·cycles                               avgt    5   67.168 ± 20.950   #/op
ACB.SAC:·instructions                         avgt    5  109.931 ± 35.905   #/op

ACB.SAC_ConstantSize                          avgt   50   13.763 ±  0.067  ns/op
ACB.SAC_ConstantSize:·CPI                     avgt    5    0.625 ±  0.024   #/op
ACB.SAC_ConstantSize:·L1-dcache-load-misses   avgt    5    2.376 ±  0.214   #/op
ACB.SAC_ConstantSize:·L1-dcache-loads         avgt    5   28.285 ±  2.127   #/op
ACB.SAC_ConstantSize:·L1-dcache-store-misses  avgt    5    2.335 ±  0.223   #/op
ACB.SAC_ConstantSize:·L1-dcache-stores        avgt    5   16.926 ±  1.467   #/op
ACB.SAC_ConstantSize:·branches                avgt    5   19.469 ±  0.869   #/op
ACB.SAC_ConstantSize:·cycles                  avgt    5   62.395 ±  3.898   #/op
ACB.SAC_ConstantSize:·instructions            avgt    5   99.891 ±  5.435   #/op

Of course, all these nanobenchmarks for arraycopy are susceptible for weird alignment-induced performance differences in the vectorized copying stubs, but that's another (horror) story, that I don't have courage to tell.

来源：https://stackoverflow.com/questions/32834869/system-arraycopy-with-constant-length

标签

java

arrays

microbenchmark

jmh