问题
I'm programming something that counts the number of UTF-8 characters in a file. I've already written the base code but now, I'm stuck in the part where the characters are supposed to be counted. So far, these are what I have:
What's inside the text file:
黄埔炒蛋
你好
こんにちは
여보세요
What I've coded so far:
#include <stdio.h>
typedef unsigned char BYTE;
int main(int argc, char const *argv[])
{
FILE *file = fopen("file.txt", "r");
if (!file)
{
printf("Could not open file.\n");
return 1;
}
int count = 0;
while(1)
{
BYTE b;
fread(&b, 1, 1, file);
if (feof(file))
{
break;
}
count++;
}
printf("Number of characters: %i\n", count);
fclose(file);
return 0;
}
My question is, how would I code the part where the UTF-8 characters are being counted? I tried to look for inspirations in GitHub and YouTube but I haven't found anything that works well with my code yet.
Edit: Originally, this code prints that the text file has 48 characters. But considering UTF-8, it should only be 18 characters.
回答1:
See: https://en.wikipedia.org/wiki/UTF-8#Encoding
Each UTF-8 sequence contains one starting byte and zero or more extra bytes.
Extra bytes always start with bits 10
and first byte never starts with that sequence.
You can use that information to count only first byte in each UTF-8 sequence.
if((b&0xC0) != 0x80) {
count++;
}
Keep in mind this will break, if file contains invalid UTF-8 sequences. Also, "UTF-8 characters" might mean different things. For example "👩🏿" will be counted as two characters by this method.
回答2:
In C, as in C++, there is no ready-made solution for counting UTF-8 characters. You can convert UTF-8 to UTF-16 using mbstowcs and use the wcslen function, but this is not the best way for performance (especially if you only need to count the number of characters and nothing else).
I think a good answer to your question is here: counting unicode characters in c++.
Еxample from answer on link:
for (p; *p != 0; ++p)
count += ((*p & 0xc0) != 0x80);
回答3:
You could look into the specs: https://tools.ietf.org/html/rfc3629.
Chapter 3 has this table in it:
Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
You could inspect the bytes and build the unicode characters.
A different point is, whether you would count a base character and its accent (combining mark cf. https://en.wikipedia.org/wiki/Combining_character) as one or as several characters.
回答4:
There are multiple options you may take:
- you may depend on your system implementation of wide encoding and multibyte encoding
- you may read the file as a wide stream and just count the bytes, depend on the system to do UTF-8 multibyte string to wide string conversion on it's own (see
main1
below) - you may read the file as bytes and convert the multibyte string into a wide string and count bytes (see
main2
below)
- you may read the file as a wide stream and just count the bytes, depend on the system to do UTF-8 multibyte string to wide string conversion on it's own (see
- You may use an external library that operates on UTF-8 strings and count the unicode characters (see
main3
below that useslibunistring
) - Or roll your own
utf8_strlen
-ish solution that will work on specific UTF-8 string property and check the bytes yourself, as showed in other answers.
Here is an example program that has to be compiled with -lunistring
under linux with rudimentary error checking with assert
:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <assert.h>
#include <stdlib.h>
void main1()
{
// read the file as wide characters
const char *l = setlocale(LC_ALL, "en_US.UTF-8");
assert(l);
FILE *file = fopen("file.txt", "r");
assert(file);
int count = 0;
while(fgetwc(file) != WEOF) {
count++;
}
fclose(file);
printf("Number of characters: %i\n", count);
}
// just a helper function cause i'm lazy
char *file_to_buf(const char *filename, size_t *strlen) {
FILE *file = fopen(filename, "r");
assert(file);
size_t n = 0;
char *ret = malloc(1);
assert(ret);
for (int c; (c = fgetc(file)) != EOF;) {
ret = realloc(ret, n + 2);
assert(ret);
ret[n++] = c;
}
ret[n] = '\0';
*strlen = n;
fclose(file);
return ret;
}
void main2() {
const char *l = setlocale(LC_ALL, "en_US.UTF-8");
assert(l);
size_t strlen = 0;
char *str = file_to_buf("file.txt", &strlen);
assert(str);
// convert multibye string to wide string
// assuming multibytes are in UTF-8
// this may also be done in a streaming fashion when reading byte by byte from a file
// and calling with `mbtowc` and checking errno for EILSEQ and managing some buffer
mbstate_t ps = {0};
const char *tmp = str;
size_t count = mbsrtowcs(NULL, &tmp, 0, &ps);
assert(count != (size_t)-1);
printf("Number of characters: %zu\n", count);
free(str);
}
#include <unistr.h> // u8_mbsnlen from libunistring
void main3() {
size_t strlen = 0;
char *str = file_to_buf("file.txt", &strlen);
assert(str);
// for simplicity I am assuming uint8_t is equal to unisgned char
size_t count = u8_mbsnlen((const uint8_t *)str, strlen);
printf("Number of characters: %zu\n", count);
free(str);
}
int main() {
main1();
main2();
main3();
}
来源:https://stackoverflow.com/questions/64846096/utf-8-character-count