Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux

前端 未结 3 1393
深忆病人
深忆病人 2021-01-02 02:08

I have a piece of code that runs 2x faster on windows than on linux. Here are the times I measured:

g++ -Ofast -march=native -m64
    29.1123
g++ -Ofast -mar         


        
相关标签:
3条回答
  • 2021-01-02 02:35

    size_t is a 64-bit unsigned type in the x86-64 System V ABI on Linux, where you're compiling a 64-bit binary. But in a 32-bit binary (like you're making on Windows), it's only 32-bit, and thus the trial-division loop is only doing 32-bit division. (size_t is for sizes of C++ objects, not files, so it only needs to be pointer width.)

    On x86-64 Linux, -m64 is the default, because 32-bit is basically considered obsolete. To make a 32-bit executable, use g++ -m32.


    Unlike most integer operations, division throughput (and latency) on modern x86 CPUs depends on the operand-size: 64-bit division is slower than 32-bit division. (https://agner.org/optimize/ for tables of instruction throughput / latency / uops for which ports).

    And it's very slow compared to other operations like multiply or especially add: your program completely bottlenecks on integer division throughput, not on the map operations. (With perf counters for a 32-bit binary on Skylake, arith.divider_active counts 24.03 billion cycles that the divide execution unit was active, out of 24.84 billion core clock cycles total. Yes that's right, division is so slow that there's a performance counter just for that execution unit. It's also a special case because it's not fully pipelined, so even in a case like this where you have independent divisions, it can't start a new one every clock cycle like it can for other multi-cycle operations like FP or integer multiply.)

    g++ unfortunately fails to optimize based on the fact that the numbers are compile-time constants and thus have limited ranges. It would be legal (and a huge speedup) for g++ -m64 to optimize to div ecx instead of div rcx. That change makes the 64-bit binary run as fast as the 32-bit binary. (It's computing exactly the same thing, just without as many high zero bits. The result is implicitly zero-extended to fill the 64-bit register, instead of explicitly calculated as zero by the divider, and that's much faster in this case.)

    I verified this on Skylake by editing the binary to replace the 0x48 REX.W prefix with 0x40, changing div rcx into div ecx with a do-nothing REX prefix. The total cycles taken was within 1% of the 32-bit binary from g++ -O3 -m32 -march=native. (And time, since the CPU happened to be running at the same clock speed for both runs.) (g++7.3 asm output on the Godbolt compiler explorer.)

    32-bit code, gcc7.3 -O3 on a 3.9GHz Skylake i7-6700k running Linux

    $ cat > primes.cpp     # and paste your code, then edit to remove the silly system("pause")
    $ g++ -Ofast -march=native -m32 primes.cpp -o prime32
    
    $ taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,arith.divider_active  ./prime32 
    Serial time = 6.37695
    
    
     Performance counter stats for './prime32':
           6377.915381      task-clock (msec)         #    1.000 CPUs utilized          
                    66      context-switches          #    0.010 K/sec                  
                     0      cpu-migrations            #    0.000 K/sec                  
                   111      page-faults               #    0.017 K/sec                  
        24,843,147,246      cycles                    #    3.895 GHz                    
         6,209,323,281      branches                  #  973.566 M/sec                  
        24,846,631,255      instructions              #    1.00  insn per cycle         
        49,663,976,413      uops_issued.any           # 7786.867 M/sec                  
        40,368,420,246      uops_executed.thread      # 6329.407 M/sec                  
        24,026,890,696      arith.divider_active      # 3767.201 M/sec                  
    
           6.378365398 seconds time elapsed
    

    vs. 64-bit with REX.W=0 (hand-edited binary)

     Performance counter stats for './prime64.div32':
    
           6399.385863      task-clock (msec)         #    1.000 CPUs utilized          
                    69      context-switches          #    0.011 K/sec                  
                     0      cpu-migrations            #    0.000 K/sec                  
                   146      page-faults               #    0.023 K/sec                  
        24,938,804,081      cycles                    #    3.897 GHz                    
         6,209,114,782      branches                  #  970.267 M/sec                  
        24,845,723,992      instructions              #    1.00  insn per cycle         
        49,662,777,865      uops_issued.any           # 7760.554 M/sec                  
        40,366,734,518      uops_executed.thread      # 6307.908 M/sec                  
        24,045,288,378      arith.divider_active      # 3757.437 M/sec                  
    
           6.399836443 seconds time elapsed
    

    vs. the original 64-bit binary:

    $ g++ -Ofast -march=native primes.cpp -o prime64
    $ taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,arith.divider_active  ./prime64
    Serial time = 20.1916
    
     Performance counter stats for './prime64':
    
          20193.891072      task-clock (msec)         #    1.000 CPUs utilized          
                    48      context-switches          #    0.002 K/sec                  
                     0      cpu-migrations            #    0.000 K/sec                  
                   148      page-faults               #    0.007 K/sec                  
        78,733,701,858      cycles                    #    3.899 GHz                    
         6,225,969,960      branches                  #  308.310 M/sec                  
        24,930,415,081      instructions              #    0.32  insn per cycle         
       127,285,602,089      uops_issued.any           # 6303.174 M/sec                  
       111,797,662,287      uops_executed.thread      # 5536.212 M/sec                  
        27,904,367,637      arith.divider_active      # 1381.822 M/sec                  
    
          20.193208642 seconds time elapsed
    

    IDK why the performance counter for arith.divider_active didn't go up more. div 64 is significantly more uops than div r32, so possibly it hurts out-of-order execution and reduces overlap of surrounding code. But we know that back-to-back div with no other instructions has a similar performance difference.

    And anyway, this code spends most of its time in that terrible trial-division loop (which checks every odd and even divisor, even though we can already rule out all the even divisors after checking the low bit... And which checks all the way up to num instead of sqrt(num), so it's horribly slow for very large primes.)

    According to perf record, 99.98% of the cpu cycles events fired in the 2nd trial-division loop, the one MaxNum - i, so div was still the entire bottleneck, and it's just a quirk of performance counters that not all the time was recorded as arith.divider_active

      3.92 │1e8:   mov    rax,rbp
      0.02 │       xor    edx,edx
     95.99 │       div    rcx
      0.05 │       test   rdx,rdx 
           │     ↓ je     238     
      ... loop counter logic to increment rcx
    

    From Agner Fog's instruction tables for Skylake:

               uops    uops      ports          latency     recip tput
               fused   unfused
    DIV r32     10     10       p0 p1 p5 p6     26           6
    DIV r64     36     36       p0 p1 p5 p6     35-88        21-83
    

    (div r64 itself is actually data-dependent on the actual size of its inputs, with small inputs being faster. The really slow cases are with very large quotients, IIRC. And probably also slower when the upper half of the 128-bit dividend in RDX:RAX is non-zero. C compilers typically only ever use div with rdx=0.)

    The ratio of the cycle counts (78733701858 / 24938804081 = ~3.15) is actually smaller than the ratio of best-case throughputs (21/6 = 3.5). It should be a pure throughput bottleneck, not latency, because the next loop iteration can start without waiting for the last division result. (Thanks to branch prediction + speculative execution.) Maybe there are some branch misses in that division loop.

    If you only found a 2x performance ratio, then you have a different CPU. Possibly Haswell, where 32-bit div throughput is 9-11 cycles, and 64-bit div throughput is 21-74.

    Probably not AMD: the best-case throughputs there are still small even for div r64. e.g. Steamroller has div r32 throughput = 1 per 13-39 cycles, and div r64 = 13-70. I'd guess that with the same actual numbers, you'd probably get the same performance even if you give them to the divider in wider registers, unlike Intel. (The worst-case goes up because the possible size of input and result is larger.) AMD integer division is only 2 uops, unlike Intel's which is microcoded as 10 or 36 uops on Skylake. (And even more for signed idiv r64 at 57 uops.) This is probably related to AMD being efficient for small numbers in wide registers.

    BTW, FP division is always single-uop, because it's more performance-critical in normal code. (Hint: nobody uses totally naive trial-division in real life for checking multiple primes if they care about performance at all. Sieve or something.)


    The key for the ordered map is a size_t, and pointers are larger in 64-bit code, making each red-black tree node significantly larger, but that's not the bottleneck.

    BTW, map<> is a terrible choice here vs. two arrays of bool prime_low[Count], prime_high[Count]: one for the low Count elements and one for the high Count. You have 2 contiguous ranges, to the key can be implicit by position. Or at least use a std::unordered_map hash table. I feel like the ordered version should have been called ordered_map, and map = unordered_map, because you often see code using map without taking advantage of the ordering.

    You could even use a std::vector<bool> to get a bitmap, using 1/8th the cache footprint.

    There is an "x32" ABI (32-bit pointers in long mode) which has the best of both worlds for processes that don't need more than 4G of virtual address space: small pointers for higher data density / smaller cache footprint in pointer-heavy data structures, but the advantages of a modern calling convention, more registers, baseline SSE2, and 64-bit integer registers for when you do need 64-bit math. But unfortunately it's not very popular. It's only a little faster, so most people don't want a third version of every library.

    In this case, you could fix the source to use unsigned int (or uint32_t if you want to be portable to systems where int is only 16 bit). Or uint_least32_t to avoid requiring a fixed-width type. You could do this only for the arg to IsPrime, or for the data structure as well. (But if you're optimizing, the key is implicit by position in an array, not explicit.)

    You could even make a version of IsPrime with a 64-bit loop and a 32-bit loop, which selects based on the size of the input.

    0 讨论(0)
  • 2021-01-02 02:41

    You don't say whether the windows/linux operating systems are 32 or 64 bit.

    On a 64-bit linux machine, if you change the size_t to an int you'll find that execution times drop on linux to a similar value to those that you have for windows.

    size_t is an int32 on win32, an int64 on win64.

    EDIT: just seen your windows disassembly.

    Your windows OS is the 32-bit variety (or at least you've compiled for 32-bit).

    0 讨论(0)
  • 2021-01-02 02:47

    Extracted answer from the edited question:

    It was caused by building 32b binaries on windows as opposed to 64b binaries on linux, here are 64b numbers for windows:

    Visual studio 2013 Debug 64b
        29.1985
    Visual studio 2013 Release 64b
        29.7469
    
    0 讨论(0)
提交回复
热议问题