Why does changing 0.1f to 0 slow down performance by 10x?

前端 未结 5 958
我在风中等你
我在风中等你 2020-11-22 04:30

Why does this bit of code,

const float x[16] = {  1.1,   1.2,   1.3,     1.4,   1.5,   1.6,   1.7,   1.8,
                       1.9,   2.0,   2.1,     2.2,          


        
5条回答
  •  忘了有多久
    2020-11-22 04:47

    Using gcc and applying a diff to the generated assembly yields only this difference:

    73c68,69
    <   movss   LCPI1_0(%rip), %xmm1
    ---
    >   movabsq $0, %rcx
    >   cvtsi2ssq   %rcx, %xmm1
    81d76
    <   subss   %xmm1, %xmm0
    

    The cvtsi2ssq one being 10 times slower indeed.

    Apparently, the float version uses an XMM register loaded from memory, while the int version converts a real int value 0 to float using the cvtsi2ssq instruction, taking a lot of time. Passing -O3 to gcc doesn't help. (gcc version 4.2.1.)

    (Using double instead of float doesn't matter, except that it changes the cvtsi2ssq into a cvtsi2sdq.)

    Update

    Some extra tests show that it is not necessarily the cvtsi2ssq instruction. Once eliminated (using a int ai=0;float a=ai; and using a instead of 0), the speed difference remains. So @Mysticial is right, the denormalized floats make the difference. This can be seen by testing values between 0 and 0.1f. The turning point in the above code is approximately at 0.00000000000000000000000000000001, when the loops suddenly takes 10 times as long.

    Update<<1

    A small visualisation of this interesting phenomenon:

    • Column 1: a float, divided by 2 for every iteration
    • Column 2: the binary representation of this float
    • Column 3: the time taken to sum this float 1e7 times

    You can clearly see the exponent (the last 9 bits) change to its lowest value, when denormalization sets in. At that point, simple addition becomes 20 times slower.

    0.000000000000000000000000000000000100000004670110: 10111100001101110010000011100000 45 ms
    0.000000000000000000000000000000000050000002335055: 10111100001101110010000101100000 43 ms
    0.000000000000000000000000000000000025000001167528: 10111100001101110010000001100000 43 ms
    0.000000000000000000000000000000000012500000583764: 10111100001101110010000110100000 42 ms
    0.000000000000000000000000000000000006250000291882: 10111100001101110010000010100000 48 ms
    0.000000000000000000000000000000000003125000145941: 10111100001101110010000100100000 43 ms
    0.000000000000000000000000000000000001562500072970: 10111100001101110010000000100000 42 ms
    0.000000000000000000000000000000000000781250036485: 10111100001101110010000111000000 42 ms
    0.000000000000000000000000000000000000390625018243: 10111100001101110010000011000000 42 ms
    0.000000000000000000000000000000000000195312509121: 10111100001101110010000101000000 43 ms
    0.000000000000000000000000000000000000097656254561: 10111100001101110010000001000000 42 ms
    0.000000000000000000000000000000000000048828127280: 10111100001101110010000110000000 44 ms
    0.000000000000000000000000000000000000024414063640: 10111100001101110010000010000000 42 ms
    0.000000000000000000000000000000000000012207031820: 10111100001101110010000100000000 42 ms
    0.000000000000000000000000000000000000006103515209: 01111000011011100100001000000000 789 ms
    0.000000000000000000000000000000000000003051757605: 11110000110111001000010000000000 788 ms
    0.000000000000000000000000000000000000001525879503: 00010001101110010000100000000000 788 ms
    0.000000000000000000000000000000000000000762939751: 00100011011100100001000000000000 795 ms
    0.000000000000000000000000000000000000000381469876: 01000110111001000010000000000000 896 ms
    0.000000000000000000000000000000000000000190734938: 10001101110010000100000000000000 813 ms
    0.000000000000000000000000000000000000000095366768: 00011011100100001000000000000000 798 ms
    0.000000000000000000000000000000000000000047683384: 00110111001000010000000000000000 791 ms
    0.000000000000000000000000000000000000000023841692: 01101110010000100000000000000000 802 ms
    0.000000000000000000000000000000000000000011920846: 11011100100001000000000000000000 809 ms
    0.000000000000000000000000000000000000000005961124: 01111001000010000000000000000000 795 ms
    0.000000000000000000000000000000000000000002980562: 11110010000100000000000000000000 835 ms
    0.000000000000000000000000000000000000000001490982: 00010100001000000000000000000000 864 ms
    0.000000000000000000000000000000000000000000745491: 00101000010000000000000000000000 915 ms
    0.000000000000000000000000000000000000000000372745: 01010000100000000000000000000000 918 ms
    0.000000000000000000000000000000000000000000186373: 10100001000000000000000000000000 881 ms
    0.000000000000000000000000000000000000000000092486: 01000010000000000000000000000000 857 ms
    0.000000000000000000000000000000000000000000046243: 10000100000000000000000000000000 861 ms
    0.000000000000000000000000000000000000000000022421: 00001000000000000000000000000000 855 ms
    0.000000000000000000000000000000000000000000011210: 00010000000000000000000000000000 887 ms
    0.000000000000000000000000000000000000000000005605: 00100000000000000000000000000000 799 ms
    0.000000000000000000000000000000000000000000002803: 01000000000000000000000000000000 828 ms
    0.000000000000000000000000000000000000000000001401: 10000000000000000000000000000000 815 ms
    0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
    0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 42 ms
    0.000000000000000000000000000000000000000000000000: 00000000000000000000000000000000 44 ms
    

    An equivalent discussion about ARM can be found in Stack Overflow question Denormalized floating point in Objective-C?.

提交回复
热议问题