UTF-8 Character Count

后端 未结 4 434
时光说笑
时光说笑 2021-01-23 10:17

I\'m programming something that counts the number of UTF-8 characters in a file. I\'ve already written the base code but now, I\'m stuck in the part where the characters are su

4条回答
  •  礼貌的吻别
    2021-01-23 10:38

    There are multiple options you may take:

    • you may depend on your system implementation of wide encoding and multibyte encoding
      • you may read the file as a wide stream and just count the bytes, depend on the system to do UTF-8 multibyte string to wide string conversion on it's own (see main1 below)
      • you may read the file as bytes and convert the multibyte string into a wide string and count bytes (see main2 below)
    • You may use an external library that operates on UTF-8 strings and count the unicode characters (see main3 below that uses libunistring)
    • Or roll your own utf8_strlen-ish solution that will work on specific UTF-8 string property and check the bytes yourself, as showed in other answers.

    Here is an example program that has to be compiled with -lunistring under linux with rudimentary error checking with assert:

    #include 
    #include 
    #include 
    #include 
    #include 
    
    void main1()
    {
        // read the file as wide characters
        const char *l = setlocale(LC_ALL, "en_US.UTF-8");
        assert(l);
        FILE *file = fopen("file.txt", "r");
        assert(file);
        int count = 0;
        while(fgetwc(file) != WEOF) {
            count++;
        }
        fclose(file);
        printf("Number of characters: %i\n", count);
    }
    
    // just a helper function cause i'm lazy
    char *file_to_buf(const char *filename, size_t *strlen) {
        FILE *file = fopen(filename, "r");
        assert(file);
        size_t n = 0;
        char *ret = malloc(1);
        assert(ret);
        for (int c; (c = fgetc(file)) != EOF;) {
            ret = realloc(ret, n + 2);
            assert(ret);
            ret[n++] = c;
        }
        ret[n] = '\0';
        *strlen = n;
        fclose(file);
        return ret;
    }
    
    void main2() {
        const char *l = setlocale(LC_ALL, "en_US.UTF-8");
        assert(l);
        size_t strlen = 0;
        char *str = file_to_buf("file.txt", &strlen);
        assert(str);
        // convert multibye string to wide string
        // assuming multibytes are in UTF-8
        // this may also be done in a streaming fashion when reading byte by byte from a file
        // and calling with `mbtowc` and checking errno for EILSEQ and managing some buffer
        mbstate_t ps = {0};
        const char *tmp = str;
        size_t count = mbsrtowcs(NULL, &tmp, 0, &ps);
        assert(count != (size_t)-1);
        printf("Number of characters: %zu\n", count);
        free(str);
    }
    
    #include  // u8_mbsnlen from libunistring
    
    void main3() {
        size_t strlen = 0;
        char *str = file_to_buf("file.txt", &strlen);
        assert(str);
        // for simplicity I am assuming uint8_t is equal to unisgned char
        size_t count = u8_mbsnlen((const uint8_t *)str, strlen);
        printf("Number of characters: %zu\n", count);
        free(str);
    }
    
    int main() {
        main1();
        main2();
        main3();
    }
    

提交回复
热议问题