Why is a ternary operator with two constants faster than one with a variable?

后端 未结 3 427
清酒与你
清酒与你 2020-12-31 10:56

In Java, I have two different statements which accomplish the same result through using ternary operators, which are as follows:

  1. num < 0 ? 0 : num;
相关标签:
3条回答
  • 2020-12-31 11:07

    First, let's rewrite the benchmark with JMH to avoid common benchmarking pitfalls.

    public class FloatCompare {
    
        @Benchmark
        public float cmp() {
            float num = ThreadLocalRandom.current().nextFloat() * 2 - 1;
            return num < 0 ? 0 : num;
        }
    
        @Benchmark
        public float mul() {
            float num = ThreadLocalRandom.current().nextFloat() * 2 - 1;
            return num * (num < 0 ? 0 : 1);
        }
    }
    

    JMH also suggests that the multiplication code is a way faster:

    Benchmark         Mode  Cnt   Score   Error  Units
    FloatCompare.cmp  avgt    5  12,940 ± 0,166  ns/op
    FloatCompare.mul  avgt    5   6,182 ± 0,101  ns/op
    

    Now it's time to engage perfasm profiler (built into JMH) to see the assembly produced by JIT compiler. Here are the most important parts of the output (comments are mine):

    cmp method:

      5,65%  │││  0x0000000002e717d0: vxorps  xmm1,xmm1,xmm1  ; xmm1 := 0
      0,28%  │││  0x0000000002e717d4: vucomiss xmm1,xmm0      ; compare num < 0 ?
      4,25%  │╰│  0x0000000002e717d8: jbe     2e71720h        ; jump if num >= 0
      9,77%  │ ╰  0x0000000002e717de: jmp     2e71711h        ; jump if num < 0
    

    mul method:

      1,59%  ││  0x000000000321f90c: vxorps  xmm1,xmm1,xmm1    ; xmm1 := 0
      3,80%  ││  0x000000000321f910: mov     r11d,1h           ; r11d := 1
             ││  0x000000000321f916: xor     r8d,r8d           ; r8d := 0
             ││  0x000000000321f919: vucomiss xmm1,xmm0        ; compare num < 0 ?
      2,23%  ││  0x000000000321f91d: cmovnbe r11d,r8d          ; r11d := r8d if num < 0
      5,06%  ││  0x000000000321f921: vcvtsi2ss xmm1,xmm1,r11d  ; xmm1 := (float) r11d
      7,04%  ││  0x000000000321f926: vmulss  xmm0,xmm1,xmm0    ; multiply
    

    The key difference is that there's no jump instructions in the mul method. Instead, conditional move instruction cmovnbe is used.

    cmov works with integer registers. Since (num < 0 ? 0 : 1) expression uses integer constants on the right side, JIT is smart enough to emit a conditional move instead of a conditional jump.

    In this benchmark, conditional jump is very inefficient, since branch prediction often fails due to random nature of numbers. That's why the branchless code of mul method appears faster.

    If we modify the benchmark in a way that one branch prevails over another, e.g by replacing

    ThreadLocalRandom.current().nextFloat() * 2 - 1
    

    with

    ThreadLocalRandom.current().nextFloat() * 2 - 0.1f
    

    then the branch prediction will work better, and cmp method will become as fast as mul:

    Benchmark         Mode  Cnt  Score   Error  Units
    FloatCompare.cmp  avgt    5  5,793 ± 0,045  ns/op
    FloatCompare.mul  avgt    5  5,764 ± 0,048  ns/op
    
    0 讨论(0)
  • 2020-12-31 11:07

    I have discovered what makes the second statement take longer, but I cannot explain why it happens, if that makes sense. That said, I do believe this should gives some greater insight into the issue we have here.

    Before I explain my reasoning I'll just tell you my discoveries outright: This has nothing to do with returning a constant or a variable from a ternary operation. It has everything to do with returning an integer or a float from a ternary operation. It comes down to this: returning a float from a ternary operation is "significantly" slower than returning an integer.

    I cannot explain why, but that is the root cause at least.

    Here's my reasoning: I used the following code to create a small text document with results, very similar to your example code.

            Random rand = new Random();
            final int intOne = 1;
            final int intZero = 0;
            final float floatOne = 1f;
            final float floatZero = 0f;
    
            final long startTime = System.nanoTime();
    
            float[] results = new float[100000000];
            for (int i = 0; i < 100000000; i++) {
                float num = (rand.nextFloat() * 2) - 1;
    //            results[i] = num < 0 ? 0 : num;
    //            results[i] = num * (num < 0 ? 0 : 1);
    
    //            results[i] = num < 0 ? 0 : 1;
    //            results[i] = (num < 0 ? 0 : 1);
    //            results[i] = (num < 0 ? 0 : num);
    //            results[i] = 1 * (num < 0 ? 0 : num);
    
    //            results[i] = num < 0 ? 0 : one;
    //            results[i] = num < 0 ? 0 : 1f;
    //            results[i] = (num < 0 ? 0 : one);
    //            results[i] = (num < 0 ? 0 : 1f);
    //            results[i] = (num < 0 ? 0 : 1);
    
    //            results[i] = (num < 0 ? 0f : 1f);
    //            results[i] = (num < 0 ? 0 : 1);
    //            results[i] = (num < 0 ? floatZero : floatOne);
    //            results[i] = (num < 0 ? intZero : intOne);
    
    //            results[i] = num < 0 ? intZero : intOne;
    
    //            results[i] = num * (num < 0 ? 0 : 1);
    //            results[i] = num * (num < 0 ? 0f : 1f);
    //            results[i] = num < 0 ? 0 : num;
            }
    
            final long endTime = System.nanoTime();
    
            String str = (endTime - startTime) + "\n";
            System.out.println(str);
            Files.write(Paths.get("test.txt"), str.getBytes(), StandardOpenOption.APPEND);
    

    For reasons I won't go into now but you can read about here, I used nanoTime() instead of currentTimeMillis(). The last line just adds the resulting time value to a text document so i can easily add comments.

    Here's the final text document, it includes the entire process of how I came to this conclusion:

    
        num < 0 ? 0 : num       // standard "intuitive" operation
        1576953800
        1576153599
        1579074600
        1564152100
        1571285399
        
        num * (num < 0 ? 0 : 1)    // strange operation that is somehow faster
        1358461100
        1347008700
        1356969200
        1343784400
        1336910000
        
        // let's remove the multiplication and focus on the ternary operation
        
        num < 0 ? 0 : 1     // without the multiplication, it is actually slower...?
        1597369200
        1586133701
        1596085700
        1657377000
        1581246399
        
        (num < 0 ? 0 : 1)     // Weird, adding the brackets back speeds it up
        1797034199
        1294372700
        1301998000
        1286479500
        1326545900
        
        (num < 0 ? 0 : num)     // adding brackets to the original operation does NOT speed it up.
        1611220001
        1585651599
        1565149099
        1728256000
        1590789800
        
        1 * (num < 0 ? 0 : num)    // the speedup is not simply from multiplication
        1588769201
        1587232199
        1589958400
        1576397900
        1599809000
        
        // Let's leave the return value out of this now, we'll just return either 0 or 1.
        
        num < 0 ? 0 : one  // returning 1f, but from a variable
        1522992400
        1590028200
        1605736200
        1578443700
        1625144700
        
        num < 0 ? 0 : 1f   // returning 1f as a constant
        1583525400
        1570701000
        1577192000
        1657662601
        1633414701
        
        // from the last 2 tests we can assume that returning a variable or returning a constant has no significant speed difference.
        // let's add the brackets back and see if that still holds up.
        
        (num < 0 ? 0 : floatOne)  // 1f as variable, but with ()
        1573152100
        1521046800
        1534993700
        1630885300
        1581605100
        
        (num < 0 ? 0 : 1f)  // 1f as constant, with ()
        1589591100
        1566956800
        1540122501
        1767168100
        1591344701
        // strangely this is not faster, where before it WAS. The only difference is that I now wrote 1f instead of 1.
        
        (num < 0 ? 0 : 1)  // lets replace 1f with 1 again, then.
        1277688700
        1284385000
        1291326300
        1307219500
        1307150100
        // the speedup is back!
        // It would seem the speedup comes from returning an integer rather than a float. (and also using brackets around the operation.. somehow)
        
        // Let's try to confirm this by replacing BOTH return values with floats, or integers.
        // We're also keeping the brackets around everything, since that appears to be required for the speedup
        
        (num < 0 ? 0f : 1f)
        1572555600
        1583899100
        1595343300
        1607957399
        1593920499
        
        (num < 0 ? 0 : 1)
        1389069400
        1296926500
        1282131801
        1283952900
        1284215401
        
        // looks promising, now lets try the same but with variables
        // final int intOne = 1;
        // final int intZero = 0;
        // final float floatOne = 1f;
        // final float floatZero = 0f;
        
        (num < 0 ? floatZero : floatOne)
        1596659301
        1600570100
        1540921200
        1582599101
        1596192400
        
        (num < 0 ? intZero : intOne)
        1280634300
        1300473900
        1304816100
        1285289801
        1286386900
        
        // from the looks of it, using a variable or constant makes no significant difference, it definitely has to do with the return type.
        
        // That said, this is still only noticeable when using brackets around the operation, without them the int operation is still slow:
        
        num < 0 ? intZero : intOne
        1567954899
        1565483600
        1593726301
        1652833999
        1545883500
        
        // lastly, lets add the multiplication with num back, knowing what we know now.
        
        num * (num < 0 ? 0 : 1)    // the original fast operation, note how it uses integer as return type.
        1379224900
        1333161000
        1350076300
        1337188501
        1397156600
        
        results[i] = num * (num < 0 ? 0f : 1f)  // knowing what we know now, using floats should be slower again.
        1572278499
        1579003401
        1660701999
        1576237400
        1590275300
        // ...and it is.
        
        // Now lets take a look at the intuitive solution
        
        num < 0 ? 0 : num      // the variable num is of type float. returning a float from a ternary operation is slower than returning an int.
        1565419400
        1569075400
        1632352999
        1570062299
        1617906200
    
    

    This all still begs the question: Why is a ternary operation that returns a float slower than one returning an int? Both an int and float are 32 bits. Without the ternary operation floats are not particularly slow, we can see that because we can multiply the returned int with a float variable, and that does not slow it down. I do not have the answer to that.

    As for why the brackets speed up the operation: I am no expert, but I'm guessing it probably has to do with the interpreter slowing down the code:

    results[i] = num < 0 ? 0 : 1;
    

    Here the interpreter sees results is an array of type float and simply replaces the integers with floats as an "optimization", this way it doesn't have to convert between types.

    results[i] = (num < 0 ? 0 : 1);
    

    Here the brackets force the interpreter to compute everything within them before doing anything else, this results in an int. Only AFTER that will the result be converted to a float so that it can fit in the array, type conversion isn't slow at all.

    Again, I have no technical knowledge to back this up, it is only my educated guess.

    Hopefully this is a good enough answer, if not at least it should point people with more technical knowledge than me in the right direction.

    0 讨论(0)
  • 2020-12-31 11:19

    I have not investigated the code generated by the java compiler or the JIT generator, but when writing compilers, I usually detect and optimize ternary operators that perform boolean to integer conversions: (num < 0 ? 0 : 1) converts the boolean value to one of 2 integer constants. In C this particular code could be rewritten as !(num < 0). This conversion can produce branchless code, which would beat the branching code generated for (num < 0 ? 0 : num) on modern CPUs, even with an additional multiplication opcode. Note however that it is rather easy to produce branchless code for (num < 0 ? 0 : num) too, but the java compiler / JIT generator might not.

    0 讨论(0)
提交回复
热议问题