Measuring memory bandwidth from the dot product of two arrays

前端 未结 2 1520
灰色年华
灰色年华 2020-11-27 07:24

The dot product of two arrays

for(int i=0; i

does not reuse data so it should be a memory bound opera

相关标签:
2条回答
  • 2020-11-27 07:46

    There's a few things going on here, that come down to:

    • You have to work fairly hard to get every last bit of performance out of the memory subsystem; and
    • Different benchmarks measure different things.

    The first helps explain why you need multiple threads to saturate the available memory bandwidth. There is a lot of concurrency in the memory system, and it taking advantage of that will often require some concurrency in your CPU code. One big reason that multiple threads of execution help is latency hiding - while one thread is stalled waiting for data to arrive, another thread may be able to take advantage of some other data that has just become available.

    The hardware helps you a lot on a single thread in this case - because the memory access is so predictable, the hardware can prefetch the data ahead of when you need it, giving you some of the advantage of latency hiding even with one thread; but there are limits to what prefetch can do. The prefetcher won't take it upon itself to cross page boundaries, for instance. The canonical reference for much of this is What Every Programmer Should Know About Memory by Ulrich Drepper, which is now old enough that some gaps are starting to show (Intel's Hot Chips overview of your Sandy Bridge processor is here - note in particular the tighter integration of the memory management hardware with the CPU).

    As to the question about comparing with memset, mbw or STREAM, comparing across benchmarks will always cause headaches, even benchmarks that claim to be measuring the same thing. In particular, "memory bandwidth" isn't a single number - performance varies quite a bit depending on the operations. Both mbw and Stream do some version of a copy operation, with STREAMs operations being spelled out here (taken straight from the web page, all operands are double-precision floating points):

    ------------------------------------------------------------------
    name        kernel                  bytes/iter      FLOPS/iter
    ------------------------------------------------------------------
    COPY:       a(i) = b(i)                 16              0
    SCALE:      a(i) = q*b(i)               16              1
    SUM:        a(i) = b(i) + c(i)          24              1
    TRIAD:      a(i) = b(i) + q*c(i)        24              2
    ------------------------------------------------------------------
    

    so roughly 1/2-1/3 of the memory operations in these cases are writes (and everything's a write in the case of memset). While individual writes can be a little slower than reads, the bigger issue is that it's much harder to saturate the memory subsystem with writes because of course you can't do the equivalent of prefetching a write. Interleaving the reads and writes helps, but your dot-product example which is essentially all reads is going to be about the best-possible case for pegging the needle on memory bandwidth.

    In addition, the STREAM benchmark is (intentionally) written completely portably, with only some compiler pragmas to suggest vectorization, so beating the STREAM benchmark isn't necessarily a warning sign, especially when what you're doing is two streaming reads.

    0 讨论(0)
  • 2020-11-27 07:53

    I made my own memory benchmark code https://github.com/zboson/bandwidth

    Here are the current results for eight threads:

    write:    0.5 GB, time 2.96e-01 s, 18.11 GB/s
    copy:       1 GB, time 4.50e-01 s, 23.85 GB/s
    scale:      1 GB, time 4.50e-01 s, 23.85 GB/s
    add:      1.5 GB, time 6.59e-01 s, 24.45 GB/s
    mul:      1.5 GB, time 6.56e-01 s, 24.57 GB/s
    triad:    1.5 GB, time 6.61e-01 s, 24.37 GB/s
    vsum:     0.5 GB, time 1.49e-01 s, 36.09 GB/s, sum -8.986818e+03
    vmul:     0.5 GB, time 9.00e-05 s, 59635.10 GB/s, sum 0.000000e+00
    vmul_sum:   1 GB, time 3.25e-01 s, 33.06 GB/s, sum 1.910421e+04
    

    Here are the currents results for 1 thread:

    write:    0.5 GB, time 4.65e-01 s, 11.54 GB/s
    copy:       1 GB, time 7.51e-01 s, 14.30 GB/s
    scale:      1 GB, time 7.45e-01 s, 14.41 GB/s
    add:      1.5 GB, time 1.02e+00 s, 15.80 GB/s
    mul:      1.5 GB, time 1.07e+00 s, 15.08 GB/s
    triad:    1.5 GB, time 1.02e+00 s, 15.76 GB/s
    vsum:     0.5 GB, time 2.78e-01 s, 19.29 GB/s, sum -8.990941e+03
    vmul:     0.5 GB, time 1.15e-05 s, 468719.08 GB/s, sum 0.000000e+00
    vmul_sum:   1 GB, time 5.72e-01 s, 18.78 GB/s, sum 1.910549e+04
    
    1. write: writes a constant (3.14159) to an array. This should be like memset.
    2. copy, scale, add, and triad are defined the same as in STREAM
    3. mul: a(i) = b(i) * c(i)
    4. vsum: sum += a(i)
    5. vmul: sum *= a(i)
    6. vmul_sum: sum += a(i)*b(i) // the dot product

    My results are consistent with STREAM. I get the highest bandwidth for vsum. The vmul method does not work currently (once the value is zero it finishes early). I can get slightly better results (by about 10%) using intrinsics and unrolling the loop which I will add later.

    0 讨论(0)
提交回复
热议问题