UTF-8 Character Count

后端 未结 4 422
时光说笑
时光说笑 2021-01-23 10:17

I\'m programming something that counts the number of UTF-8 characters in a file. I\'ve already written the base code but now, I\'m stuck in the part where the characters are su

相关标签:
4条回答
  • 2021-01-23 10:32

    In C, as in C++, there is no ready-made solution for counting UTF-8 characters. You can convert UTF-8 to UTF-16 using mbstowcs and use the wcslen function, but this is not the best way for performance (especially if you only need to count the number of characters and nothing else).

    I think a good answer to your question is here: counting unicode characters in c++.

    Еxample from answer on link:

    for (p; *p != 0; ++p)
        count += ((*p & 0xc0) != 0x80);
    
    0 讨论(0)
  • 2021-01-23 10:38

    There are multiple options you may take:

    • you may depend on your system implementation of wide encoding and multibyte encoding
      • you may read the file as a wide stream and just count the bytes, depend on the system to do UTF-8 multibyte string to wide string conversion on it's own (see main1 below)
      • you may read the file as bytes and convert the multibyte string into a wide string and count bytes (see main2 below)
    • You may use an external library that operates on UTF-8 strings and count the unicode characters (see main3 below that uses libunistring)
    • Or roll your own utf8_strlen-ish solution that will work on specific UTF-8 string property and check the bytes yourself, as showed in other answers.

    Here is an example program that has to be compiled with -lunistring under linux with rudimentary error checking with assert:

    #include <stdio.h>
    #include <wchar.h>
    #include <locale.h>
    #include <assert.h>
    #include <stdlib.h>
    
    void main1()
    {
        // read the file as wide characters
        const char *l = setlocale(LC_ALL, "en_US.UTF-8");
        assert(l);
        FILE *file = fopen("file.txt", "r");
        assert(file);
        int count = 0;
        while(fgetwc(file) != WEOF) {
            count++;
        }
        fclose(file);
        printf("Number of characters: %i\n", count);
    }
    
    // just a helper function cause i'm lazy
    char *file_to_buf(const char *filename, size_t *strlen) {
        FILE *file = fopen(filename, "r");
        assert(file);
        size_t n = 0;
        char *ret = malloc(1);
        assert(ret);
        for (int c; (c = fgetc(file)) != EOF;) {
            ret = realloc(ret, n + 2);
            assert(ret);
            ret[n++] = c;
        }
        ret[n] = '\0';
        *strlen = n;
        fclose(file);
        return ret;
    }
    
    void main2() {
        const char *l = setlocale(LC_ALL, "en_US.UTF-8");
        assert(l);
        size_t strlen = 0;
        char *str = file_to_buf("file.txt", &strlen);
        assert(str);
        // convert multibye string to wide string
        // assuming multibytes are in UTF-8
        // this may also be done in a streaming fashion when reading byte by byte from a file
        // and calling with `mbtowc` and checking errno for EILSEQ and managing some buffer
        mbstate_t ps = {0};
        const char *tmp = str;
        size_t count = mbsrtowcs(NULL, &tmp, 0, &ps);
        assert(count != (size_t)-1);
        printf("Number of characters: %zu\n", count);
        free(str);
    }
    
    #include <unistr.h> // u8_mbsnlen from libunistring
    
    void main3() {
        size_t strlen = 0;
        char *str = file_to_buf("file.txt", &strlen);
        assert(str);
        // for simplicity I am assuming uint8_t is equal to unisgned char
        size_t count = u8_mbsnlen((const uint8_t *)str, strlen);
        printf("Number of characters: %zu\n", count);
        free(str);
    }
    
    int main() {
        main1();
        main2();
        main3();
    }
    
    0 讨论(0)
  • 2021-01-23 10:40

    You could look into the specs: https://tools.ietf.org/html/rfc3629.

    Chapter 3 has this table in it:

       Char. number range  |        UTF-8 octet sequence
          (hexadecimal)    |              (binary)
       --------------------+---------------------------------------------
       0000 0000-0000 007F | 0xxxxxxx
       0000 0080-0000 07FF | 110xxxxx 10xxxxxx
       0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
       0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    

    You could inspect the bytes and build the unicode characters.

    A different point is, whether you would count a base character and its accent (combining mark cf. https://en.wikipedia.org/wiki/Combining_character) as one or as several characters.

    0 讨论(0)
  • 2021-01-23 10:47

    See: https://en.wikipedia.org/wiki/UTF-8#Encoding

    Each UTF-8 sequence contains one starting byte and zero or more extra bytes. Extra bytes always start with bits 10 and first byte never starts with that sequence. You can use that information to count only first byte in each UTF-8 sequence.

        if((b&0xC0) != 0x80) {
            count++;
        }
    

    Keep in mind this will break, if file contains invalid UTF-8 sequences. Also, "UTF-8 characters" might mean different things. For example "

    0 讨论(0)
提交回复
热议问题