Is it safe to call the functions from <cctype> with char arguments?

狂风中的少年 提交于 2020-03-17 11:55:32

问题


The C programming language says that the functions from <ctype.h> follow a common requirement:

ISO C99, 7.4p1:

In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavoir is undefined.

This means that the following code is unsafe:

int upper(const char *s, size_t index) {
  return toupper(s[index]);
}

If this code is executed on an implementation where char has the same value space as signed char and there is a character with a negative value in the string, this code invokes undefined behavior. The correct version is:

int upper(const char *s, size_t index) {
  return toupper((unsigned char) s[index]);
}

Nevertheless I see many examples in C++ that don't care about this possibility of undefined behavior. So is there anything in the C++ standard that guarantees that the above code will not lead to undefined behavior, or are all the examples wrong?

[Additional Keywords: ctype cctype isalnum isalpha isblank iscntrl isdigit isgraph islowwer isprint ispunct isspace isupper isxdigit tolower]


回答1:


For what it's worth, the Solaris Studio compilers (using stlport4) are one such compiler suite that produce an unexpected result here. Compiling and running this:

#include <stdio.h>
#include <cctype>

int main() {
    char ch = '\xa1'; // '¡' in latin-1 locales + UTF-8
    printf("is whitespace: %i\n", std::isspace(ch));
    return 0;
}

gives me:

kevin@solaris:~/scratch
$ CC -library=stlport4 whitespace.cpp && ./a.out 
is whitespace: 8

For reference:

$ CC -V
CC: Studio 12.5 Sun C++ 5.14 SunOS_i386 2016/05/31

Of course, this behavior is as documented in the C++ standard, but it's definitely surprising.


EDIT: Since it was pointed out that the above version contained undefined behavior in the attempt to assign char ch = '\xa1' due to integer overflow, here's a version that avoids that and still retains the same output:

#include <stdio.h>
#include <cctype>

int main() {
    char ch = -95;
    printf("is whitespace: %i\n", std::isspace(ch));
    return 0;
}

And that does still print 8 on my Solaris VM:

kevin@solaris:~/scratch
$ CC -library=stlport4 whitespace.cpp && ./a.out 
is whitespace: 8

EDIT 2: And here's a program that might otherwise look sane but gives an unexpected result due to UB in the use of std::isspace():

#include <cstdio>
#include <cstring>
#include <cctype>

static int count_whitespace(const char* str, int n) {
    int count = 0;
    for (int i = 0; i < n; i++)
        if (std::isspace(str[i]))  // oops!
            count += 1;
    return count;
}

int main() {
    const char* batman = "I am batman\xa1";
    int n = std::strlen(batman);
    std::printf("%i\n", count_whitespace(batman, n));
    return 0;
}

And, on my Solaris machine:

kevin@solaris:~/scratch
$ CC whitespace.cpp && ./a.out
3

Note that depending on how you permute this program, you'll probably get the expected result of two whitespace characters; that is, there is almost certainly some compiler optimization kicking in that takes advantage of this UB to give you the wrong result faster.

You could imagine this biting you in the face if you were, for example, attempting to tokenize a UTF-8 string by searching for (non-multibyte) whitespace characters in the string. Such a program would behave correctly when casting str[i] to unsigned char.




回答2:


Sometimes most people are wrong. I think that's so here. Having said that there's nothing to stop an standard library implementor defining the behaviour that most people expect. So maybe that's why most people don't care, since they've never actually seen a bug resulting from this error.




回答3:


The history behind the char type is that it was originally the type used to describe 7-bit ASCII characters. At the same time, C lacked a separate 8 bit integer type. So in the pre-standard days of the eighties, some compilers made char unsigned - since it doesn't make sense to have negative indices in a symbol table, while other compilers made char signed, to make it consistent with all the other integer types.

When the time came to standardize C, both versions existed. Unfortunately, the committee decided to let it remain that way, leaving the decision to the compiler. Instead they added two other types: signed char and unsigned char. signed char is part of the signed integer types, unsigned char is part of the unsigned integer types, and char is part of neither, though it must have the same representation as either signed char or unsigned char. (This is all described in C11 6.2.5)

Notably, char never was anything but 8 bits on all known implementations, save from some exotic oddball DSPs that worked with 16 bit bytes. When "extended" symbol tables were used, either the implementation changed from 7 to 8 bit characters, or wchar_t was used. Please note that wchar_t has been in the C language since the beginning, so assuming that char was at some point used for things like UTF8 is probably incorrect (though theoretically possible).

Now if char is signed, and you store a value larger than CHAR_MAX or smaller than CHAR_MIN inside it, you invoke undefined behavior, as per C11 6.5 §5. Period. So if you have an array of char and any item inside it violate the type boundaries, you have undefined behavior there already. Even though character types have to trap representations, undefined behavior could cause the code to misbehave in other ways, such as incorrect optimizations.

The ctype.h functions allow EOF as parameter, but should otherwise behave as if working with character types, even though the parameter is int to allow EOF. The text from 7.4 §1 is mostly saying that "if you pass some random int to this function, which is neither of the same representation as a char, nor EOF, the behavior is undefined".

But if you pass a char where you have already invoked signed integer overflow/underflow, you already have undefined behavior even before calling the function - this has nothing to do with the ctype.h functions or any other function. Thus your assumption that the posted "upper" function is unsafe is incorrect - this code is no different from any other code using the char type.

An example of undefined behavior caused by the cited ctype.h restrictions in 7.4 would rather be something like toupper(666).



来源:https://stackoverflow.com/questions/7131026/is-it-safe-to-call-the-functions-from-cctype-with-char-arguments

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!