UTF-8 Character Count

后端未结

关注

 4  432

I\'m programming something that counts the number of UTF-8 characters in a file. I\'ve already written the base code but now, I\'m stuck in the part where the characters are su

相关标签:

4条回答

不知归路

2021-01-23 10:32
In C, as in C++, there is no ready-made solution for counting UTF-8 characters. You can convert UTF-8 to UTF-16 using mbstowcs and use the wcslen function, but this is not the best way for performance (especially if you only need to count the number of characters and nothing else).

I think a good answer to your question is here: counting unicode characters in c++.

Еxample from answer on link:
```
for (p; *p != 0; ++p)
    count += ((*p & 0xc0) != 0x80);
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

礼貌的吻别

2021-01-23 10:38

There are multiple options you may take:

you may depend on your system implementation of wide encoding and multibyte encoding
- you may read the file as a wide stream and just count the bytes, depend on the system to do UTF-8 multibyte string to wide string conversion on it's own (see main1 below)
- you may read the file as bytes and convert the multibyte string into a wide string and count bytes (see main2 below)
You may use an external library that operates on UTF-8 strings and count the unicode characters (see main3 below that uses libunistring)
Or roll your own utf8_strlen-ish solution that will work on specific UTF-8 string property and check the bytes yourself, as showed in other answers.

Here is an example program that has to be compiled with -lunistring under linux with rudimentary error checking with assert:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <assert.h>
#include <stdlib.h>

void main1()
{
    // read the file as wide characters
    const char *l = setlocale(LC_ALL, "en_US.UTF-8");
    assert(l);
    FILE *file = fopen("file.txt", "r");
    assert(file);
    int count = 0;
    while(fgetwc(file) != WEOF) {
        count++;
    }
    fclose(file);
    printf("Number of characters: %i\n", count);
}

// just a helper function cause i'm lazy
char *file_to_buf(const char *filename, size_t *strlen) {
    FILE *file = fopen(filename, "r");
    assert(file);
    size_t n = 0;
    char *ret = malloc(1);
    assert(ret);
    for (int c; (c = fgetc(file)) != EOF;) {
        ret = realloc(ret, n + 2);
        assert(ret);
        ret[n++] = c;
    }
    ret[n] = '\0';
    *strlen = n;
    fclose(file);
    return ret;
}

void main2() {
    const char *l = setlocale(LC_ALL, "en_US.UTF-8");
    assert(l);
    size_t strlen = 0;
    char *str = file_to_buf("file.txt", &strlen);
    assert(str);
    // convert multibye string to wide string
    // assuming multibytes are in UTF-8
    // this may also be done in a streaming fashion when reading byte by byte from a file
    // and calling with `mbtowc` and checking errno for EILSEQ and managing some buffer
    mbstate_t ps = {0};
    const char *tmp = str;
    size_t count = mbsrtowcs(NULL, &tmp, 0, &ps);
    assert(count != (size_t)-1);
    printf("Number of characters: %zu\n", count);
    free(str);
}

#include <unistr.h> // u8_mbsnlen from libunistring

void main3() {
    size_t strlen = 0;
    char *str = file_to_buf("file.txt", &strlen);
    assert(str);
    // for simplicity I am assuming uint8_t is equal to unisgned char
    size_t count = u8_mbsnlen((const uint8_t *)str, strlen);
    printf("Number of characters: %zu\n", count);
    free(str);
}

int main() {
    main1();
    main2();
    main3();
}

0 讨论(0)

臣服心动

2021-01-23 10:40
You could look into the specs: https://tools.ietf.org/html/rfc3629.

Chapter 3 has this table in it:
```
   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
```
You could inspect the bytes and build the unicode characters.

A different point is, whether you would count a base character and its accent (combining mark cf. https://en.wikipedia.org/wiki/Combining_character) as one or as several characters.
0 讨论(0)
发布评论:

提交评论
- 加载中...
情深已故

2021-01-23 10:47
See: https://en.wikipedia.org/wiki/UTF-8#Encoding

Each UTF-8 sequence contains one starting byte and zero or more extra bytes. Extra bytes always start with bits 10 and first byte never starts with that sequence. You can use that information to count only first byte in each UTF-8 sequence.
```
    if((b&0xC0) != 0x80) {
        count++;
    }
```
Keep in mind this will break, if file contains invalid UTF-8 sequences. Also, "UTF-8 characters" might mean different things. For example "
0 讨论(0)
发布评论:

提交评论
- 加载中...