Numpy: What is special about division by 0.5?

后端 未结 2 1416
-上瘾入骨i
-上瘾入骨i 2021-02-05 01:21

This answer of @Dunes states, that due to pipeline-ing there is (almost) no difference between floating-point multiplication and division. However, from my expience with other l

相关标签:
2条回答
  • 2021-02-05 02:03

    At first I suspected that numpy is invoking BLAS, but at least on my machine (python 2.7.13, numpy 1.11.2, OpenBLAS), it doesn't, as a quick check with gdb reveals:

    > gdb --args python timing.py
    ...
    Size: 100000000.0
    ^C
    Thread 1 "python" received signal SIGINT, Interrupt.
    sse2_binary_scalar2_divide_DOUBLE (op=0x7fffb3aee010, ip1=0x7fffb3aee010, ip2=0x6fe2c0, n=100000000)
        at numpy/core/src/umath/simd.inc.src:491
    491 numpy/core/src/umath/simd.inc.src: No such file or directory.
    (gdb) disass
       ...
       0x00007fffe6ea6228 <+392>:   movapd (%rsi,%rax,8),%xmm0
       0x00007fffe6ea622d <+397>:   divpd  %xmm1,%xmm0
    => 0x00007fffe6ea6231 <+401>:   movapd %xmm0,(%rdi,%rax,8)
       ...
    (gdb) p $xmm1
    $1 = {..., v2_double = {0.5, 0.5}, ...}
    

    In fact, numpy is running exactly the same generic loop regardless of the constant used. So all timing differences are purely due to the CPU.

    Actually, division is an instruction with a highly variable execution time. The amount of work to be done depends on the bit patterns of the operands, and special cases can also be detected and sped up. According to these tables (whose accuracy I do not know), on your E5-2620 (a Sandy Bridge) DIVPD has a latency and an inverse throughput of 10-22 cycles, and MULPS has latency 10 cycles and inverse throughput of 5 cycles.

    Now, as for A*2.0 being slower than A*=2.0. gdb shows that exactly the same function is being used for multiplication, except now the output op differs from first input ip1. So it has to be purely an artifact of the extra memory being drawn into cache slowing down the non-inplace operation for the large input (even though MULPS is producing only 2*8/5 = 3.2 bytes of output per cycle!). When using the 1e4-sized buffers, everything fits in cache, so that doesn't have a significant effect, and other overheads mostly drown out the difference between A/=0.5 and A/=0.51.

    Still, there are lots of weird effects in those timings, so I plotted some a graph (code to generate this is below)

    I've plotted size of the A array against the number of CPU cycles per DIVPD/MULPD/ADDPD instruction. I ran this on a 3.3GHz AMD FX-6100. Yellow and red vertical lines are L2 and L3 cache size. The blue line is the supposed maximum throughput of DIVPD according to those tables, 1/4.5cycles (which seems dubious). As you can see, not even A+=2.0 gets anywhere near this, even when the "Overhead" of performing an numpy operation falls close to zero. So there is about 24 cycles of overhead just looping and reading and writing 16 bytes to/from L2 cache! Pretty shocking, maybe the memory accesses aren't aligned.

    Lots of interesting effects to note:

    • Below arrays of 30KB the majority of time is overhead in python/numpy
    • Multiplication and addition are the same speed (as given in Agner's tables)
    • The difference in speed between A/=0.5 and A/=0.51 drops towards the right of graph; this is because when the time to read/write memory increases, it overlaps and masks some of the time taken to do the division. For that reason, A/=0.5, A*=2.0 and A+=2.0 become the same speed.
    • Comparing the maximum difference between A/=0.51, A/=0.5 and A+=2.0 suggests that division has a throughput of 4.5-44 cycles, which fails to match the 4.5-11 in Agner's table.
    • However, the difference between A/=0.5 and A/=0.51 mostly disappears when the numpy overhead gets large, although there's still a few cycles difference. This is hard to explain, because numpy overhead can't mask time to do the division.
    • Operations which are not in-place (dashed lines) become incredibly slow when much larger than the L3 cache size, but in-place operations don't. They require double the memory bandwidth to RAM, but I can't explain why they would be 20x slower!
    • The dashed lines diverge on the left. This is assumably because division and multiplication are handled by different numpy functions with different amounts of overhead.

    Unfortunately, on another machine with a CPU with different FPU speed, cache size, memory bandwidth, numpy version, etc, these curves could look quite different.

    My take-away from this is: chaining together multiple arithmetic operations with numpy is going to be many times slower than doing the same in Cython while iterating over the inputs just once, because there is no "sweet spot" at which the cost of the arithmetic operations dominates the other costs.

    import numpy as np
    import timeit
    import matplotlib.pyplot as plt
    
    CPUHz = 3.3e9
    divpd_cycles = 4.5
    L2cachesize = 2*2**20
    L3cachesize = 8*2**20
    
    def timeit_command(command, pieces, size):
        return min(timeit.repeat("for i in xrange(%d): %s" % (pieces, command),
                                 "import numpy; A = numpy.random.rand(%d)" % size, number = 6))
    
    def run():
        totaliterations = 1e7
    
        commands=["A/=0.5", "A/=0.51", "A/0.5", "A*=2.0", "A*2.0", "A+=2.0"]
        styles=['-', '-', '--', '-', '--', '-']
    
        def draw_graph(command, style, compute_overhead = False):
            sizes = []
            y = []
            for pieces in np.logspace(0, 5, 11):
                size = int(totaliterations / pieces)
                sizes.append(size * 8)  # 8 bytes per double
                time = timeit_command(command, pieces, (4 if compute_overhead else size))
                # Divide by 2 because SSE instructions process two doubles each
                cycles = time * CPUHz / (size * pieces / 2)
                y.append(cycles)
            if compute_overhead:
                command = "numpy overhead"
            plt.semilogx(sizes, y, style, label = command, linewidth = 2, basex = 10)
    
        plt.figure()
        for command, style in zip(commands, styles):
            print command
            draw_graph(command, style)
        # Plot overhead
        draw_graph("A+=1.0", '-', compute_overhead=True)
    
        plt.legend(loc = 'best', prop = {'size':9}, handlelength = 3)
        plt.xlabel('Array size in bytes')
        plt.ylabel('CPU cycles per SSE instruction')
    
        # Draw vertical and horizontal lines
        ymin, ymax = plt.ylim()
        plt.vlines(L2cachesize, ymin, ymax, color = 'orange', linewidth = 2)
        plt.vlines(L3cachesize, ymin, ymax, color = 'red', linewidth = 2)
        xmin, xmax = plt.xlim()
        plt.hlines(divpd_cycles, xmin, xmax, color = 'blue', linewidth = 2)
    
    0 讨论(0)
  • 2021-02-05 02:03

    Intel CPUs have special optimizations when dividing by powers of two. See, for example, http://www.agner.org/optimize/instruction_tables.pdf, where it states

    FDIV latency depends on precision specified in control word: 64 bits precision gives latency 38, 53 bits precision gives latency 32, 24 bits precision gives latency 18. Division by a power of 2 takes 9 clocks.

    Although this applies to FDIV and not DIVPD (as @RalphVersteegen's answer notes), it would be quite surprising if DIVPD did not also implement this optimization.


    Division is normally a very slow affair. However, a division by a power of two is just an exponent shift, and the mantissa usually doesn't need to change. This makes the operation very fast. Furthermore, it's easy to detect a power of two in floating-point representation as the mantissa will be all zeros (with an implicit leading 1), so this optimization is both easy to test for and cheap to implement.

    0 讨论(0)
提交回复
热议问题