Why can a null character be embedded in a conversion specifier for scanf?

问题

Perhaps I'm misinterpreting my results, but:

#include <stdio.h>

int
main(void)
{
    char buf[32] = "";
    int x;
    x = scanf("%31[^\0]", buf);
    printf("x = %d, buf=%s", x, buf);
}
$ printf 'foo\n\0bar' | ./a.out
x = 1, buf=foo

Since the string literal "%31[^\0]" contains an embedded null, it seems that it should be treated the same as "%31[^", and the compiler should complain that the [ is unmatched. Indeed, if you replace the string literal, clang gives:

warning: no closing ']' for '%[' in scanf format string [-Wformat]

Why does it work to embed a null character in the string literal passed to scanf?

-- EDIT --

The above is undefined behavior and merely happens to "work".

回答1:

First of all, Clang totally fails to output any meaningful diagnostics here, whereas GCC knows exactly what is happening - so yet again GCC 1 - 0 Clang.

And as for the format string - well, it doesn't work. The format argument to scanf is a string. The string ends at terminating null, i.e. the format string you're giving to scanf is

scanf("%31[^", buf);

On my computer, compiling the program gives

% gcc scanf.c
scanf.c: In function ‘main’:
scanf.c:8:20: warning: no closing ‘]’ for ‘%[’ format [-Wformat=]
    8 |     x = scanf("%31[^\0]", buf);
      |                    ^
scanf.c:8:21: warning: embedded ‘\0’ in format [-Wformat-contains-nul]
    8 |     x = scanf("%31[^\0]", buf);
      |                     ^~

The scanset must have the closing right bracket ], otherwise the conversion specifier is invalid. If conversion specifier is invalid, the behaviour is undefined.

And, on my computer running it,

% printf 'foo\n\0bar' | ./a.out
x = 0, buf=

Q.E.D.

回答2:

This is a rather strange situation. I think there are a couple of things going on.

First of all, a string in C ends by definition at the first \0. You can always scoff at this rule, for example by writing a string literal with an explicit \0 in the middle of it. When you do, though, the characters after the \0 are mostly invisible. Very few standard library functions are able to see them, because of course just about everything that interprets a C string will stop at the first \0 it finds.

However: the string you pass as the first argument to scanf is typically parsed twice -- and by "parsed" I mean actually interpreted as a scanf format string possibly containing special % sequences. It's always going to be parsed at run time, by the actual copy of scanf in your C run-time library. But it's typically also parsed by the compiler, at compile time, so that the compiler can warn you if the % sequences don't match the actual arguments you call it with. (The run-time library code for scanf, of course, is unable to perform this checking.)

Now, of course, there's a pretty significant potential problem here: what if the parsing performed by the compiler is in some way different than the parsing performed by the actual scanf code in the run-time library? That might lead to confusing results.

And, to my considerable surprise, it looks like the scanf format parsing code in compilers can (and in some cases does) do something special and unexpected. clang doesn't (it doesn't complain about the malformed string at all), but gcc says both "no closing ‘]’ for ‘%[’ format" and "embedded ‘\0’ in format". So it's noticing.

This is possible (though still surprising) because the compiler, at least, can see the whole string literal, and is in a position to notice that the null character is an explicit one inserted by the programmer, not the more usual implicit one appended by the compiler. And indeed the warning "embedded ‘\0’ in format" emitted by gcc proves that gcc, at least, is quite definitely written to accommodate this possibility. (See the footnote below for a bit more on the compiler's ability to "see" the whole string literal.)

But the second question is, why does it (seem to) work at runtime? What is the actual scanf code in the C library doing?

That code, at least, has no way of knowing that the \0 was explicit and that there are "real" characters following it. That code simply must stop at the first \0 that it finds. So it's operating as if the format string was

"%31[^"

That's a malformed format string, of course. The run-time library code isn't required to do anything reasonable. But my copy, like yours, is able to read the full string "foo". What's up with that?

My guess is that after seeing the % and the [ and the ^, and deciding that it's going to scan characters not matching some set, it's perfectly willing to, in effect, infer the missing ], and sail on matching characters from the scanset, which ends up having no excluded characters.

I tested this by trying the variant

    x = scanf("%31[^\0o]", buf);

This also matched and printed "foo", not "f".

Obviously things are nothing like guaranteed to work like this, of course. @AnttiHaapala has already posted an answer showing that his C RTL declines to scan "foo" with the malformed scan string at all.

Footnote: Most of the time, an embedded in \0 in a string truly, prematurely ends it. Most of the time, everything following the \0 is effectively invisible, because at run time, every piece of string interpreting code will stop at the first \0 it finds, with no way to know whether it was one explicitly inserted by the programmer or implicitly appended by the compiler. But as we've seen, the compiler can tell the difference, because the compiler (obviously) can see the entire string literal, exactly as entered by the programmer. Here's proof:

char str1[] = "Hello, world!";
char str2[] = "Hello\0world!";

printf("sizeof(str1) = %zu, strlen(str1) = %zu\n", sizeof(str1), strlen(str1));
printf("sizeof(str2) = %zu, strlen(str2) = %zu\n", sizeof(str2), strlen(str2));

Normally, sizeof on a string literal gives you a number one bigger than strlen. But this code prints:

sizeof(str1) = 14, strlen(str1) = 13
sizeof(str2) = 13, strlen(str2) = 5

Just for fun I also tried:

char str3[5] = "Hello";

This time, though, strlen gave a larger number:

sizeof(str3) = 5, strlen(str3) = 10

I was mildly lucky. str3 has no trailing \0, neither one inserted by me nor appended by the compiler, so strlen sails off the end, and could easily have counted hundreds or thousands of characters before finding a random \0 somewhere in memory, or crashing.

回答3:

Why can a null character be embedded in a conversion specifier for scanf?

A null character cannot directly be specified as part of a scanset as in "%31[^\0]" as the parsing of the string ends with the first null character.

"%31[^\0]" is parsed by scanf() as if it was "%31[^". As it is an invalid scanf() specifier, UB will likely follow. A compiler may provide diagnostics on more than what scanf() sees.

A null character can be part of a scanset as in "%31[^\n]". This will read in all characters, including the null character, other than '\n'.

In the unusual case of reading null chracters, to determine the number of characters read scanned, use "%n".

int n = 0;
scanf("%31[^\n]%n", buf, &n);
scanf("%*1[\n]"); // Consume any 1 trailing \n
if (n) {
  printf("First part of buf=%s, %d characters read ", buf, n);
}

来源：https://stackoverflow.com/questions/66013007/why-can-a-null-character-be-embedded-in-a-conversion-specifier-for-scanf

标签

scanf