The C Standard explicitly specifies signed integer overflow as having undefined behavior. Yet most CPUs implement signed arithmetics with defined semantics for overflow (except maybe for division overflow: x / 0 and INT_MIN / -1).

Compilers writers have been taking advantage of the undefinedness of such overflows to add more aggressive optimisations that tend to break legacy code in very subtle ways. For example this code may have worked on older compilers but does not anymore on current versions of gcc and clang:

/* Tncrement a by a value in 0..255, clamp a to positive integers.
   The code relies on 32-bit wrap-around, but the C Standard makes
   signed integer overflow undefined behavior, so sum_max can now 
   return values less than a. There are Standard compliant ways to
   implement this, but legacy code is what it is... */
int sum_max(int a, unsigned char b) {
    int res = a + b;
    return (res >= a) ? res : INT_MAX;
}

Is there hard evidence that these optimisations are worthwhile? Are there comparative studies documenting the actual improvements on real life examples or even on classical benchmarks?

I came up with this question as I was watching this: C++Now 2018: John Regehr “Closing Keynote: Undefined Behavior and Compiler Optimizations”

I am tagging c and c++ as the problem is similar in both languages but the answers might be different.

I don't know about studies and statistics, but yes, there are definitely optimizations taking this into account that compilers actually do. And yes, they are very important (tldr loop vectorization for example).

Besides the compiler optimizations, there is another aspect to be taken into account. With UB you get C/C++ signed integers to behave arithmetically as you would expect mathematically. For instance x + 10 > x holds true now (for valid code of course), but would not on a wrap-around behavior.

I've found an excellent article How undefined signed overflow enables optimizations in GCC from Krister Walfridsson’s blog listing some optimizations that take signed overflow UB into account. The following examples are from it. I am adding c++ and assembly examples to them.

If the optimizations look too simple, uninteresting or unimpactful, remember that these optimization are just steps in a much much larger chain of optimizations. And the butterfly effect does happen as a seemingly unimportant optimization at an earlier step can trigger a much more impactful optimization at a later step.

If the examples look nonsensical (who would write x * 10 > 0) keep in mind that you can very easily get to this kind of examples in C and C++ with constants, macros, templates. Besides the compiler can get to this kind of examples when applying transformations and optimizations in its IR.

Signed integer expression simplification

Eliminate multiplication in comparison with 0

(x * c) cmp 0   ->   x cmp 0

bool foo(int x) { return x * 10 > 0 }

foo(int):
        test    edi, edi
        setg    al
        ret

Eliminate division after multiplication

(x * c1) / c2 -> x * (c1 / c2) if c1 is divisible by c2
```
int foo(int x) { return (x * 20) / 10; }
```
```
foo(int):
        lea     eax, [rdi+rdi]
        ret
```

Eliminate negation

(-x) / (-y) -> x / y

int foo(int x, int y) { return (-x) / (-y); }

foo(int, int):
        mov     eax, edi
        cdq
        idiv    esi
        ret

Simplify comparisons that are always true or false

x + c < x       ->   false
x + c <= x      ->   false
x + c > x       ->   true
x + c >= x      ->   true

bool foo(int x) { return x + 10 >= x; }

foo(int):
        mov     eax, 1
        ret

Eliminate negation in comparisons

(-x) cmp (-y)   ->   y cmp x

bool foo(int x, int y) { return -x < -y; }

foo(int, int):
        cmp     edi, esi
        setg    al
        ret

Reduce magnitude of constants

x + c > y       ->   x + (c - 1) >= y
x + c <= y      ->   x + (c - 1) < y

bool foo(int x, int y) { return x + 10 <= y; }

foo(int, int):
        add     edi, 9
        cmp     edi, esi
        setl    al
        ret

Eliminate constants in comparisons

(x + c1) cmp c2         ->   x cmp (c2 - c1)
(x + c1) cmp (y + c2)   ->   x cmp (y + (c2 - c1)) if c1 <= c2
The second transformation is only valid if c1 <= c2, as it would otherwise introduce an overflow when y has the value INT_MIN.

bool foo(int x) { return x + 42 <= 11; }

foo(int):
        cmp     edi, -30
        setl    al
        ret

Pointer arithmetic and type promotion

If an operation does not overflow, then we will get the same result if we do the operation in a wider type. This is often useful when doing things like array indexing on 64-bit architectures — the index calculations are typically done using 32-bit int, but the pointers are 64-bit, and the compiler may generate more efficient code when signed overflow is undefined by promoting the 32-bit integers to 64-bit operations instead of generating type extensions.

One other aspect of this is that undefined overflow ensures that a[i] and a[i+1] are adjacent. This improves analysis of memory accesses for vectorization etc.

This is a very important optimization as loop vectorization one of the most efficient and effective optimization algorithms.

It is trickier to demonstrate. But I remember actually encountering a situation when changing an index from unsigned to signed drastically improved the generated assembly. Unfortunately I cannot remember or replicate it now. Will come back later if I figure it out.

Value range calculations

The compiler keeps track of the variables' range of possible values at each point in the program, i.e. for code such as
int x = foo();
if (x > 0) {
  int y = x + 5;
  int z = y / 4;
it determines that x has the range [1, INT_MAX] after the if-statement, and can thus determine that y has the range [6, INT_MAX] as overflow is not allowed. And the next line can be optimized to int z = y >> 2; as the compiler knows that y is non-negative.

auto foo(int x)
{
    if (x <= 0)
        __builtin_unreachable();

    return (x + 5) / 4;
}

foo(int):
        lea     eax, [rdi+5]
        sar     eax, 2
        ret

The undefined overflow helps optimizations that need to compare two values (as the wrapping case would give possible values of the form [INT_MIN, (INT_MIN+4)] or [6, INT_MAX] that prevents all useful comparisons with < or >), such as

Changing comparisons x<y to true or false if the ranges for x and y does not overlap

Changing min(x,y) or max(x,y) to x or y if the ranges do not overlap

Changing abs(x) to x or -x if the range does not cross 0

Changing x/c to x>>log2(c) if x>0 and the constant c is a power of 2

Changing x%c to x&(c-1) if x>0 and the constant c is a power of 2

Loop analysis and optimization

The canonical example of why undefined signed overflow helps loop optimizations is that loops like
for (int i = 0; i <= m; i++)
are guaranteed to terminate for undefined overflow. This helps architectures that have specific loop instructions, as they do in general not handle infinite loops.

But undefined signed overflow helps many more loop optimizations. All analysis such as determining number of iteration, transforming induction variables, and keeping track of memory accesses are using everything in the previous sections in order to do its work. In particular, the set of loops that can be vectorized are severely reduced when signed overflow is allowed.

Not quite an example of optimization, but one useful consequence of undefined behaviour is -ftrapv command line switch of GCC/clang. It inserts code which crashes your program on integer overflow.

It won't work on unsigned integers, in accordance with the idea that unsigned overflow is intentional.

The Standard's wording on signed integer overflow ensures that people won't write overflowing code on purpose, so ftrapv is a useful tool to discover unintentional overflow.

Here's an actual little benchmark, bubble sort. I've compared timings without/with -fwrapv (which means the overflow is UB/not UB). Here are the results (seconds):

                   -O3     -O3 -fwrapv    -O1     -O1 -fwrapv
Machine1, clang    5.2     6.3            6.8     7.7
Machine2, clang-8  4.2     7.8            6.4     6.7
Machine2, gcc-8    6.6     7.4            6.5     6.5

As you can see, the not-UB (-fwrapv) version is almost always slower, the largest difference is pretty big, 1.85x.

Here's the code. Note, that I intentionally chose an implementation, which should produce a larger difference for this test.

#include <stdio.h>
#include <stdlib.h>

void bubbleSort(int *a, long n) {
        bool swapped;
        for (int i = 0; i < n-1; i++) {
                swapped = false;
                for (int j = 0; j < n-i-1; j++) {
                        if (a[j] > a[j+1]) {
                                int t = a[j];
                                a[j] = a[j+1];
                                a[j+1] = t;
                                swapped = true;
                        }
                }

                if (!swapped) break;
        }
}

int main() {
        int a[8192];

        for (int j=0; j<100; j++) {
                for (int i=0; i<8192; i++) {
                        a[i] = rand();
                }

                bubbleSort(a, 8192);
        }
}

The answer is actually in your question:

Yet most CPUs implement signed arithmetics with defined semantics

I can't think of a CPU that you can buy today that does not use twos-compliment arithmetic for signed integers, but that wasn't always the case.

The C language was invented in 1972. Back then, IBM 7090 mainframes still existed. Not all computers were twos-compliment.

To have defined the language (and overflow behaviour) around 2s-compliment would have been prejudicial to code generation on machines that weren't.

Furthermore, as it has already been said, specifying that signed overflow is to be UB allows the compiler to produce better code, because it can discount code paths that result from signed overflow, assuming that this will never happen.

If I understand correctly that it's intended to clamp the sum of a and b to 0....INT_MAX without wraparound, I can think of two ways to write this function in a compliant way.

First, the inefficient general case that will work on all cpus:

int sum_max(int a, unsigned char b) {
    if (a > std::numeric_limits<int>::max() - b)
        return std::numeric_limits<int>::max();
    else
        return a + b;
}

Second, the surprisingly efficient 2s-compliment specific way:

int sum_max2(int a, unsigned char b) {
    unsigned int buffer;
    std::memcpy(&buffer, &a, sizeof(a));
    buffer += b;
    if (buffer > std::numeric_limits<int>::max())
        buffer = std::numeric_limits<int>::max();
    std::memcpy(&a, &buffer, sizeof(a));
    return a;
}

Resulting assembler can be seen here: https://godbolt.org/z/F42IXV

来源：https://stackoverflow.com/questions/56047702/is-there-some-meaningful-statistical-data-to-justify-keeping-signed-integer-arit

标签

c++