c reading non ASCII characters

前端 未结 3 705
星月不相逢
星月不相逢 2021-01-13 09:58

I am parsing a file that involves characters such as æ ø å. If we assume I have stored a line of the text file as follows<

相关标签:
3条回答
  • 2021-01-13 10:08

    You need to understand which encoding is used for your characters. I guess it is very probably UTF-8 (and you should use UTF8 everywhere....), read Joel's blog on Unicode. If your encoding is not UTF-8 you should convert it to UTF-8 e.g. using libiconv.

    Then you need a C library for UTF-8. There are many of them (but none is standardized in the C11 language yet). I recommend libunistring or glib (from GTK), but see also this.

    Your code will change, since an UTF-8 character can take one to four [8 bits] bytes (but Wikipedia UTF-8 page mentions 6 bytes at most; See Unicode standards for details). You won't test if a byte (i.e. a plain C char) is a letter, but if a byte and the few bytes after it (given by a pointer, i.e. a char* or better by uint8_t*) encode a letter (including cyrillic letters, etc..).

    Not every sequence of bytes is a valid UTF-8 representation, and you might want to validate a line (or a null-terminated C string) before analyzing it.

    0 讨论(0)
  • 2021-01-13 10:08

    Let's say you use UTF-8.

    You need to understand how UTF-8 works.

    Here's a little piece of work which should do what you want :

    int nbChars(char *str) {
        int len = 0;
        int i = 0;
        int charSize = 0; // Size of the current char in byte
    
        if (!str)
            return -1;
        while (str[i])
        {
            if (charSize == 0)
            {
                ++len;
                if (!(str[i] >> 7 & 1)) // ascii char
                    charSize = 1;
                else if (!(str[i] >> 5 & 1))
                    charSize = 2;
                else if (!(str[i] >> 4 & 1))
                    charSize = 3;
                else if (!(str[i] >> 3 & 1))
                    charSize = 4;
                else
                    return -1; // not supposed to happen
            }
            else if (str[i] >> 6 & 3 != 2)
                return -1;
            --charSize;
            ++i;
        }
        return len;
    }
    

    It returns the number of chars, and -1 if it's not a valid UTF-8 string.

    (By non-valid UTF-8 string, I mean the format is not valid. I don't check if the character actually exists)

    EDIT: As stated in the comment section, this code doesn't handle decomposed unicode

    0 讨论(0)
  • 2021-01-13 10:21

    The C standard IO library can only read bytes. Your file probably contains multibyte characters, encoded with UTF8 or some other encoding. You'll need a library for interpreting such files.

    It is possible that your file contains Latin1 text, in which case characters are bytes. In this case, you cannot use isgraph unless you have the proper locale set.

    Bottom line: find the encoding used in your file. Then read it accordingly. In any case, plain C does not know about encodings.

    0 讨论(0)
提交回复
热议问题