UTF-8 Character Count

。_饼干妹妹 提交于 2021-02-05 08:28:52

问题


I'm programming something that counts the number of UTF-8 characters in a file. I've already written the base code but now, I'm stuck in the part where the characters are supposed to be counted. So far, these are what I have:

What's inside the text file:

黄埔炒蛋
你好
こんにちは
여보세요

What I've coded so far:

#include <stdio.h>

typedef unsigned char BYTE;

int main(int argc, char const *argv[])
{
    FILE *file = fopen("file.txt", "r");
    if (!file)
    {
        printf("Could not open file.\n");
        return 1;
    }
    int count = 0;

    while(1)
    {
        BYTE b;
        fread(&b, 1, 1, file);
        if (feof(file))
        {
            break;
        }
        count++;
    }
    printf("Number of characters: %i\n", count);

    fclose(file);

    return 0;
}

My question is, how would I code the part where the UTF-8 characters are being counted? I tried to look for inspirations in GitHub and YouTube but I haven't found anything that works well with my code yet.

Edit: Originally, this code prints that the text file has 48 characters. But considering UTF-8, it should only be 18 characters.


回答1:


See: https://en.wikipedia.org/wiki/UTF-8#Encoding

Each UTF-8 sequence contains one starting byte and zero or more extra bytes. Extra bytes always start with bits 10 and first byte never starts with that sequence. You can use that information to count only first byte in each UTF-8 sequence.

    if((b&0xC0) != 0x80) {
        count++;
    }

Keep in mind this will break, if file contains invalid UTF-8 sequences. Also, "UTF-8 characters" might mean different things. For example "👩🏿" will be counted as two characters by this method.




回答2:


In C, as in C++, there is no ready-made solution for counting UTF-8 characters. You can convert UTF-8 to UTF-16 using mbstowcs and use the wcslen function, but this is not the best way for performance (especially if you only need to count the number of characters and nothing else).

I think a good answer to your question is here: counting unicode characters in c++.

Еxample from answer on link:

for (p; *p != 0; ++p)
    count += ((*p & 0xc0) != 0x80);



回答3:


You could look into the specs: https://tools.ietf.org/html/rfc3629.

Chapter 3 has this table in it:

   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

You could inspect the bytes and build the unicode characters.

A different point is, whether you would count a base character and its accent (combining mark cf. https://en.wikipedia.org/wiki/Combining_character) as one or as several characters.




回答4:


There are multiple options you may take:

  • you may depend on your system implementation of wide encoding and multibyte encoding
    • you may read the file as a wide stream and just count the bytes, depend on the system to do UTF-8 multibyte string to wide string conversion on it's own (see main1 below)
    • you may read the file as bytes and convert the multibyte string into a wide string and count bytes (see main2 below)
  • You may use an external library that operates on UTF-8 strings and count the unicode characters (see main3 below that uses libunistring)
  • Or roll your own utf8_strlen-ish solution that will work on specific UTF-8 string property and check the bytes yourself, as showed in other answers.

Here is an example program that has to be compiled with -lunistring under linux with rudimentary error checking with assert:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <assert.h>
#include <stdlib.h>

void main1()
{
    // read the file as wide characters
    const char *l = setlocale(LC_ALL, "en_US.UTF-8");
    assert(l);
    FILE *file = fopen("file.txt", "r");
    assert(file);
    int count = 0;
    while(fgetwc(file) != WEOF) {
        count++;
    }
    fclose(file);
    printf("Number of characters: %i\n", count);
}

// just a helper function cause i'm lazy
char *file_to_buf(const char *filename, size_t *strlen) {
    FILE *file = fopen(filename, "r");
    assert(file);
    size_t n = 0;
    char *ret = malloc(1);
    assert(ret);
    for (int c; (c = fgetc(file)) != EOF;) {
        ret = realloc(ret, n + 2);
        assert(ret);
        ret[n++] = c;
    }
    ret[n] = '\0';
    *strlen = n;
    fclose(file);
    return ret;
}

void main2() {
    const char *l = setlocale(LC_ALL, "en_US.UTF-8");
    assert(l);
    size_t strlen = 0;
    char *str = file_to_buf("file.txt", &strlen);
    assert(str);
    // convert multibye string to wide string
    // assuming multibytes are in UTF-8
    // this may also be done in a streaming fashion when reading byte by byte from a file
    // and calling with `mbtowc` and checking errno for EILSEQ and managing some buffer
    mbstate_t ps = {0};
    const char *tmp = str;
    size_t count = mbsrtowcs(NULL, &tmp, 0, &ps);
    assert(count != (size_t)-1);
    printf("Number of characters: %zu\n", count);
    free(str);
}

#include <unistr.h> // u8_mbsnlen from libunistring

void main3() {
    size_t strlen = 0;
    char *str = file_to_buf("file.txt", &strlen);
    assert(str);
    // for simplicity I am assuming uint8_t is equal to unisgned char
    size_t count = u8_mbsnlen((const uint8_t *)str, strlen);
    printf("Number of characters: %zu\n", count);
    free(str);
}

int main() {
    main1();
    main2();
    main3();
}


来源:https://stackoverflow.com/questions/64846096/utf-8-character-count

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!