How can I resolve data dependency in pointer arrays?

问题

If we have an array of integer pointers which all pointing to the same int, and loop over it doing ++ operation, it'll be 100% slower than those pointers pointing to two different ints. Here is a concrete example

int* data[2];
int a, b;
a = b = 0;
for (auto i = 0ul; i < 2; ++i) {
    // Case 3: 2.5 sec
    data[i] = &a;

    // Case 2: 1.25 sec
    // if (i & 1)
    //     data[i] = &a;
    // else
    //     data[i] = &b;
}

for (auto i = 0ul; i < 1000000000; ++i) {
    // Case 1: 0.5sec
    // asm volatile("" : "+g"(i)); // deoptimize
    // ++*data[0];

    ++*data[i & 1];
}

In summary, the observations are: (described the loop body)

case 1 (fast): ++*pointer[0]

case 2 (medium): ++*pointer[i] with half pointer pointing to one int and other half pointing to another int.

case 3 (slow): ++*pointer[i] with all pointer pointing to the same int

Here are my current thoughts. Case 1 is fast because modern CPU knows we are read/write the same memory location, thus buffering the operation, while in Case 2 and Case 3, we need to write the result out in each iteration. The reason that Case 3 is slower than Case 2 is because when we write to a memory location by pointer a, and then trying to read it by pointer b, we have to wait the write to finish. This stops superscalar execution.

Do I understand it correctly? Is there any way to make Case 3 faster without changing the pointer array? (perhaps adding some CPU hints?)

The question is extracted from the real problem https://github.com/ClickHouse/ClickHouse/pull/7550

回答1:

You've discovered one of the effects that causes bottlenecks in histograms. A workaround for that problem is to keep multiple arrays of counters and rotate through them, so repeated runs of the same index are distributed over 2 or 4 different counters in memory.

(Then loop over the arrays of counters to sum them down into one final set of counts. This part can benefit from SIMD.)

Case 1 is fast because modern CPU knows we are read/write the same memory location, thus buffering the operation

No, it's not the CPU, it's a compile-time optimization.

++*pointer[0] is fast because the compiler can hoist the store/reload out of the loop and actually just increment a register. (If you don't use the result, it might optimize away even that.)

Assumption of no data-race UB lets the compiler assume that nothing else is modifying pointer[0] so it's definitely the same object being incremented every time. And the as-if rule lets it keep *pointer[0] in a register instead of actually doing a memory-destination increment.

So that means 1 cycle latency for the increment, and of course it can combine multiple increments into one and do *pointer[0] += n if it fully unrolls and optimizes away the loop.

when we write to a memory location by pointer a, and then trying to read it by pointer b, we have to wait the write to finish. This stops superscalar execution.

Yes, the data dependency through that memory location is the problem. Without knowing at compile time that the pointers all point to the same place, the compiler will make asm that does actually increment the pointed-to memory location.

"wait for the write to finish" isn't strictly accurate, though. The CPU has a store buffer to decouple store execution from cache misses, and out-of-order speculative exec from stores actually committing to L1d and being visible to other cores. A reload of recently-stored data doesn't have to wait for it to commit to cache; store forwarding from the store-buffer to a reload is a thing once the CPU detects it.

On modern Intel CPUs, store-forwarding latency is about 5 cycles, so a memory-destination add has 6-cycle latency. (1 for the add, 5 for the store/reload if it's on the critical path.)

And yes, out-of-order execution lets two of these 6-cycle-latency dependency chains run in parallel. And the loop overhead is hidden under that latency, again by OoO exec.

Store-to-Load Forwarding and Memory Disambiguation in x86 Processors on stuffedcow.net
Store forwarding Address vs Data: What the difference between STD and STA in the Intel Optimization guide?
How does store to load forwarding happens in case of unaligned memory access?
Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?
Why is execution time of a process shorter when another process shares the same HT core (On Sandybridge-family, store-forwarding latency can be reduced if you don't try to reload right away.)

Is there any way to make Case 3 faster without changing the pointer array?

Yes, if that case is expected, maybe branch on it:

    int *current_pointer = pointer[0];
    int repeats = 1;
    ...

    loop {
        if (pointer[i] == current_pointer) {
            repeats++;
        } else {
            *current_pointer += repeats;
            current_pointer = pointer[i];
            repeats = 1;
        }
    }

We optimize by counting a run-length of repeating the same pointer.

This is totally defeated by Case 2 and will perform poorly if long runs are not common.

Short runs can be hidden by out-of-order exec; only when the dep chain becomes long enough to fill the ROB (reorder buffer) do we actually stall.

来源：https://stackoverflow.com/questions/58648563/how-can-i-resolve-data-dependency-in-pointer-arrays

标签

c++

performance

compiler-optimization

micro-optimization