Why is if (variable1 % variable2 == 0) inefficient?

前端 未结 4 1134
我寻月下人不归
我寻月下人不归 2021-01-29 18:51

I am new to java, and was running some code last night, and this really bothered me. I was building a simple program to display every X outputs in a for loop, and I noticed a MA

相关标签:
4条回答
  • 2021-01-29 19:23

    You are measuring the OSR (on-stack replacement) stub.

    OSR stub is a special version of compiled method intended specifically for transferring execution from interpreted mode to compiled code while the method is running.

    OSR stubs are not as optimized as regular methods, because they need a frame layout compatible with interpreted frame. I showed this already in the following answers: 1, 2, 3.

    A similar thing happens here, too. While "inefficient code" is running a long loop, the method is compiled specially for the on-stack replacement right inside the loop. The state is transferred from the interpreted frame to OSR-compiled method, and this state includes progressCheck local variable. At this point JIT cannot replace the variable with the constant, and thus cannot apply certain optimizations like strength reduction.

    In particular this means JIT does not replace integer division with multiplication. (See Why does GCC use multiplication by a strange number in implementing integer division? for the asm trick from an ahead-of-time compiler, when the value is a compile-time constant after inlining / constant-propagation, if those optimizations are enabled. An integer literal right in the % expression also gets optimized by gcc -O0, similar to here where it's optimized by the JITer even in an OSR stub.)

    However, if you run the same method several times, the second and the subsequent runs will execute the regular (non-OSR) code, which is fully optimized. Here is a benchmark to prove the theory (benchmarked using JMH):

    @State(Scope.Benchmark)
    public class Div {
    
        @Benchmark
        public void divConst(Blackhole blackhole) {
            long startNum = 0;
            long stopNum = 100000000L;
    
            for (long i = startNum; i <= stopNum; i++) {
                if (i % 50000 == 0) {
                    blackhole.consume(i);
                }
            }
        }
    
        @Benchmark
        public void divVar(Blackhole blackhole) {
            long startNum = 0;
            long stopNum = 100000000L;
            long progressCheck = 50000;
    
            for (long i = startNum; i <= stopNum; i++) {
                if (i % progressCheck == 0) {
                    blackhole.consume(i);
                }
            }
        }
    }
    

    And the results:

    # Benchmark: bench.Div.divConst
    
    # Run progress: 0,00% complete, ETA 00:00:16
    # Fork: 1 of 1
    # Warmup Iteration   1: 126,967 ms/op
    # Warmup Iteration   2: 105,660 ms/op
    # Warmup Iteration   3: 106,205 ms/op
    Iteration   1: 105,620 ms/op
    Iteration   2: 105,789 ms/op
    Iteration   3: 105,915 ms/op
    Iteration   4: 105,629 ms/op
    Iteration   5: 105,632 ms/op
    
    
    # Benchmark: bench.Div.divVar
    
    # Run progress: 50,00% complete, ETA 00:00:09
    # Fork: 1 of 1
    # Warmup Iteration   1: 844,708 ms/op          <-- much slower!
    # Warmup Iteration   2: 105,893 ms/op          <-- as fast as divConst
    # Warmup Iteration   3: 105,601 ms/op
    Iteration   1: 105,570 ms/op
    Iteration   2: 105,475 ms/op
    Iteration   3: 105,702 ms/op
    Iteration   4: 105,535 ms/op
    Iteration   5: 105,766 ms/op
    

    The very first iteration of divVar is indeed much slower, because of inefficiently compiled OSR stub. But as soon as the method reruns from the beginning, the new unconstrained version is executed which leverages all the available compiler optimizations.

    0 讨论(0)
  • 2021-01-29 19:29

    In follow-up to @phuclv comment, I checked the code generated by JIT1, the results are as follows:

    for variable % 5000 (division by constant):

    mov     rax,29f16b11c6d1e109h
    imul    rbx
    mov     r10,rbx
    sar     r10,3fh
    sar     rdx,0dh
    sub     rdx,r10
    imul    r10,rdx,0c350h    ; <-- imul
    mov     r11,rbx
    sub     r11,r10
    test    r11,r11
    jne     1d707ad14a0h
    

    for variable % variable:

    mov     rax,r14
    mov     rdx,8000000000000000h
    cmp     rax,rdx
    jne     22ccce218edh
    xor     edx,edx
    cmp     rbx,0ffffffffffffffffh
    je      22ccce218f2h
    cqo
    idiv    rax,rbx           ; <-- idiv
    test    rdx,rdx
    jne     22ccce218c0h
    

    Because division always takes longer than multiplication, the last code snippet is less performant.

    Java version:

    java version "11" 2018-09-25
    Java(TM) SE Runtime Environment 18.9 (build 11+28)
    Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode)
    

    1 - VM options used: -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,src/java/Main.main

    0 讨论(0)
  • 2021-01-29 19:30

    I am also surprised by seeing the performance of the above codes. It's all about the time taken by the compiler for executing the program as per the declared variable. In the second (inefficient) example:

    for (long i = startNum; i <= stopNum; i++) {
        if (i % progressCheck == 0) {
            System.out.println(i)
        }
    }
    

    You are performing the modulus operation between two variables. Here, compiler has to check the value of stopNum and progressCheck to go to the specific memory block located for these variables every time after each iteration because it is a variable and its value might be change.

    That's why after each iteration compiler went to the memory location to check the latest value of the variables. Therefore at the compile time the compiler was not able to create efficient byte code.

    In the first code example, you are performing modulus operator between a variable and a constant numeric value which is not going to change within execution and compiler no need to check the value of that numeric value from the memory location. That's why compiler was able to create efficient byte code. If you declare progressCheck as a final or as a final static variable then at the time of run-time/compile-time compiler know that it's a final variable and its value not going to change then compiler replace the progressCheck with 50000 in code:

    for (long i = startNum; i <= stopNum; i++) {
        if (i % 50000== 0) {
            System.out.println(i)
        }
    }
    

    Now you can see that this code also looks like the first (efficient) code example. The performance of first code and as we mentioned above both code will work efficiently. There will not be much difference in execution time of either code example.

    0 讨论(0)
  • 2021-01-29 19:47

    As others have noted, the general modulus operation requires a division to be done. In some cases, the division can be replaced (by the compiler) by a multiplication. But both can be slow compared to addition/subtraction. Hence, the best performance can be expected by something along these lines:

    long progressCheck = 50000;
    
    long counter = progressCheck;
    
    for (long i = startNum; i <= stopNum; i++){
        if (--counter == 0) {
            System.out.println(i);
            counter = progressCheck;
        }
    }
    

    (As a minor optmiziation attempt we use a pre-decrement down-counter here because on many architectures comparing to 0 immediately after an arithmetic operation costs exactly 0 instructions/CPU cycles because the ALU's flags are already set appropriately by the preceeding operation. A decent optimizing compiler will, however, do that optimization automatically even if you write if (counter++ == 50000) { ... counter = 0; }.)

    Notice that often you don't really want/need modulus, because you know that your loop counter (i) or whatever is only ever incremented by 1, and you really don't care about the actual remainder the modulus will give you, just see if the incrementing-by-one counter hits some value.

    Another 'trick' is to use power-of-two values/limits, e.g. progressCheck = 1024;. Modulus a power of two can be quickly calculated via bitwise and, i.e. if ( (i & (1024-1)) == 0 ) {...}. This should be pretty fast too, and may on some architectures outperform the explicit counter above.

    0 讨论(0)
提交回复
热议问题