I\'m programming something that counts the number of UTF-8 characters in a file. I\'ve already written the base code but now, I\'m stuck in the part where the characters are su
In C, as in C++, there is no ready-made solution for counting UTF-8 characters. You can convert UTF-8 to UTF-16 using mbstowcs and use the wcslen function, but this is not the best way for performance (especially if you only need to count the number of characters and nothing else).
I think a good answer to your question is here: counting unicode characters in c++.
Еxample from answer on link:
for (p; *p != 0; ++p)
count += ((*p & 0xc0) != 0x80);
There are multiple options you may take:
main1
below)main2
below)main3
below that uses libunistring
)utf8_strlen
-ish solution that will work on specific UTF-8 string property and check the bytes yourself, as showed in other answers.Here is an example program that has to be compiled with -lunistring
under linux with rudimentary error checking with assert
:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <assert.h>
#include <stdlib.h>
void main1()
{
// read the file as wide characters
const char *l = setlocale(LC_ALL, "en_US.UTF-8");
assert(l);
FILE *file = fopen("file.txt", "r");
assert(file);
int count = 0;
while(fgetwc(file) != WEOF) {
count++;
}
fclose(file);
printf("Number of characters: %i\n", count);
}
// just a helper function cause i'm lazy
char *file_to_buf(const char *filename, size_t *strlen) {
FILE *file = fopen(filename, "r");
assert(file);
size_t n = 0;
char *ret = malloc(1);
assert(ret);
for (int c; (c = fgetc(file)) != EOF;) {
ret = realloc(ret, n + 2);
assert(ret);
ret[n++] = c;
}
ret[n] = '\0';
*strlen = n;
fclose(file);
return ret;
}
void main2() {
const char *l = setlocale(LC_ALL, "en_US.UTF-8");
assert(l);
size_t strlen = 0;
char *str = file_to_buf("file.txt", &strlen);
assert(str);
// convert multibye string to wide string
// assuming multibytes are in UTF-8
// this may also be done in a streaming fashion when reading byte by byte from a file
// and calling with `mbtowc` and checking errno for EILSEQ and managing some buffer
mbstate_t ps = {0};
const char *tmp = str;
size_t count = mbsrtowcs(NULL, &tmp, 0, &ps);
assert(count != (size_t)-1);
printf("Number of characters: %zu\n", count);
free(str);
}
#include <unistr.h> // u8_mbsnlen from libunistring
void main3() {
size_t strlen = 0;
char *str = file_to_buf("file.txt", &strlen);
assert(str);
// for simplicity I am assuming uint8_t is equal to unisgned char
size_t count = u8_mbsnlen((const uint8_t *)str, strlen);
printf("Number of characters: %zu\n", count);
free(str);
}
int main() {
main1();
main2();
main3();
}
You could look into the specs: https://tools.ietf.org/html/rfc3629.
Chapter 3 has this table in it:
Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
You could inspect the bytes and build the unicode characters.
A different point is, whether you would count a base character and its accent (combining mark cf. https://en.wikipedia.org/wiki/Combining_character) as one or as several characters.
See: https://en.wikipedia.org/wiki/UTF-8#Encoding
Each UTF-8 sequence contains one starting byte and zero or more extra bytes.
Extra bytes always start with bits 10
and first byte never starts with that sequence.
You can use that information to count only first byte in each UTF-8 sequence.
if((b&0xC0) != 0x80) {
count++;
}
Keep in mind this will break, if file contains invalid UTF-8 sequences. Also, "UTF-8 characters" might mean different things. For example "