Does wide character input/output in C always read from / write to the correct (system default) encoding?

前端 未结 3 899
梦毁少年i
梦毁少年i 2021-01-02 09:11

I\'m primarily interested in the Unix-like systems (e.g., portable POSIX) as it seems like Windows does strange things for wide characters.

Do the read and write wid

相关标签:
3条回答
  • 2021-01-02 09:51

    So long as the locale is set correctly, there shouldn't be any issues processing UTF-8 files on a system using UTF-8, using the wide character functions. They'll be able to interpret things correctly, i.e. they'll treat a character as 1-4 bytes as necessary (in both input and output). You can test it out by something like this:

    #include <stdio.h>
    #include <locale.h>
    #include <wchar.h>
    
    int main()
    {
        setlocale(LC_CTYPE, "en_GB.UTF-8");
        // setlocale(LC_CTYPE, ""); // to use environment variable instead
        wchar_t *txt = L"£Δᗩ";
    
        wprintf(L"The string %ls has %d characters\n", txt, wcslen(txt));
    }
    
    $ gcc -o loc loc.c && ./loc
    The string £Δᗩ has 3 characters
    

    If you use the standard functions (in particular character functions) on multibyte strings carelessly, things will start to break, e.g. the equivalent:

    char *txt = "£Δᗩ";
    printf("The string %s has %zu characters\n", txt, strlen(txt));
    
    $ gcc -o nloc nloc.c && ./nloc
    The string £Δᗩ has 7 characters
    

    The string still prints correctly here because it's essentially just a stream of bytes, and as the system is expecting UTF-8 sequences, they're translated perfectly. Of course strlen is reporting the number of bytes in the string, 7 (plus the \0), with no understanding that a character and a byte aren't equivalent.

    In this respect, because of the compatibility between ASCII and UTF-8, you can often get away with treating UTF-8 files as simply multibyte C strings, as long as you're careful.

    There's a degree of flexibility as well. It's possible to convert a standard C string (as a multibyte string) to a wide character string easily:

    char *stdtxt = "ASCII and UTF-8 €£¢";
    wchar_t buf[100]; 
    mbstowcs(buf, stdtxt, 20);
    
    wprintf(L"%ls has %zu wide characters\n", buf, wcslen(buf));
    
    Output:
    ASCII and UTF-8 €£¢ has 19 wide characters
    

    Once you've used a wide character function on a stream, it's set to wide orientation. If you later want to use standard byte i/o functions, you'll need to re-open the stream first. This is probably why the recommendation is not to use it on stdout. However, if you only use wide character functions on stdin and stdout (including any code that you link to), you will not have any problems.

    0 讨论(0)
  • 2021-01-02 09:59

    Don't use fputs with anything else than ASCII.

    If you want to write down lets say UTF8, then use a function who return the real size used by the utf8 string and use fwrite to write the good number of bytes, without worrying of vicious '\0' inside the string.

    0 讨论(0)
  • 2021-01-02 10:15

    The relevant text governing the behavior of the wide character stdio functions and their relationship to locale is from POSIX XSH 2.5.2 Stream Orientation and Encoding Rules:

    http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_05_02

    Basically, the wide character stdio functions always write in the encoding that's in effect (per the LC_CTYPE locale category) at the time the FILE stream becomes wide-oriented; this means the first time a wide stdio function is called on it, or fwide is used to set the orientation to wide. So as long as a proper LC_CTYPE locale is in effect matching the desired "system" encoding (e.g. UTF-8) when you start working with the stream, everything should be fine.

    However, one important consideration you should not overlook is that you must not mix byte and wide oriented operations on the same FILE stream. Failure to observe this rule is not a reportable error; it simply results in undefined behavior. As a good deal of library code assumes stderr is byte oriented (and some even makes the same assumption about stdout), I would strongly discourage ever using wide-oriented functions on the standard streams. If you do, you need to be very careful about which library functions you use.

    Really, I can't think of any reason at all to use wide-oriented functions. fprintf is perfectly capable of sending wide-character strings to byte-oriented FILE streams using the %ls specifier.

    0 讨论(0)
提交回复
热议问题