At what point in the loop does integer overflow become undefined behavior?

后端 未结 12 2116
南方客
南方客 2020-12-07 18:12

This is an example to illustrate my question which involves some much more complicated code that I can\'t post here.

#include 
int main()
{
           


        
相关标签:
12条回答
  • 2020-12-07 18:38

    TartanLlama's answer is correct. The undefined behavior can happen at any time, even during compile time. This may seem absurd, but it's a key feature to permit compilers to do what they need to do. It's not always easy to be a compiler. You have to do exactly what the spec says, every time. However, sometimes it can be monstrously difficult to prove that a particular behavior is occurring. If you remember the halting problem, its rather trivial to develop software for which you cannot prove whether it completes or enters an infinite loop when fed a particular input.

    We could make compilers be pessimistic, and constantly compile in fear that the next instruction might be one of these halting problem like issues, but that isn't reasonable. Instead we give the compiler a pass: on these "undefined behavior" topics, they are freed from any responsibility. Undefined behavior consists of all of the behaviors which are so subtly nefarious that we have trouble separating them from the really-nasty-nefarious halting problems and whatnot.

    There is an example which I love to post, though I admit I lost the source to, so I have to paraphrase. It was from a particular version of MySQL. In MySQL, they had a circular buffer which was filled with user-provided data. They, of course, wanted to make sure that the data didn't overflow the buffer, so they had a check:

    if (currentPtr + numberOfNewChars > endOfBufferPtr) { doOverflowLogic(); }
    

    It looks sane enough. However, what if numberOfNewChars is really big, and overflows? Then it wraps around and becomes a pointer smaller than endOfBufferPtr, so the overflow logic would never get called. So they added a second check, before that one:

    if (currentPtr + numberOfNewChars < currentPtr) { detectWrapAround(); }
    

    It looks like you took care of the buffer overflow error, right? However, a bug was submitted stating that this buffer overflowed on a particular version of Debian! Careful investigation showed that this version of Debian was the first to use a particularly bleeding-edge version of gcc. On this version of gcc, the compiler recognized that currentPtr + numberOfNewChars can never be a smaller pointer than currentPtr because overflow for pointers is undefined behavior! That was sufficient for gcc to optimize out the entire check, and suddenly you were not protected against buffer overflows even though you wrote the code to check it!

    This was spec behavior. Everything was legal (though from what I heard, gcc rolled back this change in the next version). It's not what I would consider intuitive behavior, but if you stretch your imagination a bit, it's easy to see how a slight variant of this situation could become a halting problem for the compiler. Because of this, the spec writers made it "Undefined Behavior" and stated that the compiler could do absolutely anything it pleased.

    0 讨论(0)
  • 2020-12-07 18:40

    Undefined behavior is, by definition, a grey area. You simply can't predict what it will or won't do -- that's what "undefined behavior" means.

    Since time immemorial, programmers have always tried to salvage remnants of definedness from an undefined situation. They've got some code they really want to use, but which turns out to be undefined, so they try to argue: "I know it's undefined, but surely it will, at worst, do this or this; it will never do that." And sometimes these arguments are more or less right -- but often, they're wrong. And as the compilers get smarter and smarter (or, some people might say, sneakier and sneakier), the boundaries of the question keep changing.

    So really, if you want to write code that's guaranteed to work, and that will keep working for a long time, there's only one choice: avoid ye the undefined behavior at all costs. Verily, if you dabble in it, it will come back to haunt you.

    0 讨论(0)
  • 2020-12-07 18:41

    First, let me correct the title of this question:

    Undefined Behavior is not (specifically) of the realm of execution.

    Undefined Behavior affects all steps: compiling, linking, loading and executing.

    Some examples to cement this, bear in mind that no section is exhaustive:

    • the compiler can assume that portions of code that contain Undefined Behavior are never executed, and thus assume the execution paths that would lead to them are dead code. See What every C programmer should know about undefined behavior by none other than Chris Lattner.
    • the linker can assume that in the presence of multiple definitions of a weak symbol (recognized by name), all definitions are identical thanks to the One Definition Rule
    • the loader (in case you use dynamic libraries) can assume the same, thus picking the first symbol it finds; this is usually (ab)used for intercepting calls using LD_PRELOAD tricks on Unixes
    • the execution might fail (SIGSEV) should you use dangling pointers

    This is what is so scary about Undefined Behavior: it is nigh impossible to predict, ahead of time, what exact behavior will occur, and this prediction has to be revisited at each update of the toolchain, underlying OS, ...


    I recommend watching this video by Michael Spencer (LLVM Developer): CppCon 2016: My Little Optimizer: Undefined Behavior is Magic.

    0 讨论(0)
  • 2020-12-07 18:41

    The top answer is a wrong (but common) misconception:

    Undefined behavior is a run-time property*. It CANNOT "time-travel"!

    Certain operations are defined (by the standard) to have side-effects and cannot be optimized away. Operations that do I/O or that access volatile variables fall in this category.

    However, there is a caveat: UB can be any behavior, including behavior that undoes previous operations. This can have similar consequences, in some cases, to optimizing out earlier code.

    In fact, this is consistent with the quote in the top answer (emphasis mine):

    A conforming implementation executing a well-formed program shall produce the same observable behavior as one of the possible executions of the corresponding instance of the abstract machine with the same program and the same input.
    However, if any such execution contains an undefined operation, this International Standard places no requirement on the implementation executing that program with that input (not even with regard to operations preceding the first undefined operation).

    Yes, this quote does say "not even with regard to operations preceding the first undefined operation", but notice that this is specifically about code that is being executed, not merely compiled.
    After all, undefined behavior that isn't actually reached doesn't do anything, and for the line containing UB to be actually reached, code that precedes it must execute first!

    So yes, once UB is executed, any effects of previous operations become undefined. But until that happens, the execution of the program is well-defined.

    Note, however, that all executions of the program that result in this happening can be optimized to equivalent programs, including any that perform previous operations but then un-do their effects. Consequently, preceding code may be optimized away whenever doing so would be equivalent to their effects being undone; otherwise, it can't. See below for an example.

    *Note: This is not inconsistent with UB occurring at compile time. If the compiler can indeed prove that UB code will always be executed for all inputs, then UB can extend to compile time. However, this requires knowing that all previous code eventually returns, which is a strong requirement. Again, see below for an example/explanation.


    To make this concrete, note that the following code must print foo and wait for your input regardless of any undefined behavior that follows it:

    printf("foo");
    getchar();
    *(char*)1 = 1;
    

    However, also note that there is no guarantee that foo will remain on the screen after the UB occurs, or that the character you typed will no longer be in the input buffer; both of these operations can be "undone", which has a similar effect to UB "time-travel".

    If the getchar() line wasn't there, it would be legal for the lines to be optimized away if and only if that would be indistinguishable from outputting foo and then "un-doing" it.

    Whether or not the two would be indistinguishable would depend entirely on the implementation (i.e. on your compiler and standard library). For example, can your printf block your thread here while waiting for another program to read the output? Or will it return immediately?

    • If it can block here, then another program can refuse to read its full output, and it may never return, and consequently UB may never actually occur.

    • If it can return immediately here, then we know it must return, and therefore optimizing it out is entirely indistinguishable from executing it and then un-doing its effects.

    Of course, since the compiler knows what behavior is permissible for its particular version of printf, it can optimize accordingly, and consequently printf may get optimized out in some cases and not others. But, again, the justification is that this would be indistinguishable from the UB un-doing previous operations, not that the previous code is "poisoned" because of UB.

    0 讨论(0)
  • 2020-12-07 18:45

    An aggressively optimising C or C++ compiler targeting a 16 bit int will know that the behaviour on adding 1000000000 to an int type is undefined.

    It is permitted by either standard to do anything it wants which could include the deletion of the entire program, leaving int main(){}.

    But what about larger ints? I don't know of a compiler that does this yet (and I'm not an expert in C and C++ compiler design by any means), but I imagine that sometime a compiler targeting a 32 bit int or higher will figure out that the loop is infinite (i doesn't change) and so a will eventually overflow. So once again, it can optimise the output to int main(){}. The point I'm trying to make here is that as compiler optimisations become progressively more aggressive, more and more undefined behaviour constructs are manifesting themselves in unexpected ways.

    The fact that your loop is infinite is not in itself undefined since you are writing to standard output in the loop body.

    0 讨论(0)
  • 2020-12-07 18:45

    One thing your example doesn't consider is optimisation. a is set in the loop but never used, and an optimiser could work this out. As such, it is legitimate for the optimiser to discard a completely, and in that case all undefined behaviour vanishes like a boojum's victim.

    However of course this itself is undefined, because optimisation is undefined. :)

    0 讨论(0)
提交回复
热议问题