How to get the value of UTF-8 character

问题

I have an utf-8 character in chinese or arabic language. I need to get the value of that UTF-8 character, like getting a value of ASCII character. I need to implement it in "C". Can you please provide your suggestions?

For example:

char array[3] = "ab";
int v1,v2;

v1 = array[0];
v2 = array[1];

In the above code I will get corresponding ASCII values in v1 and v2. In the same way for UF8 string I need to get the value for each character in a string.

回答1:

Only the C11 standard version of the C language offers UTF-8 support, so depending on what standard you are targeting, you can use the C11 features (<uchar.h>) or rely on a UTF library such as ICU.

回答2:

There is no such thing as a UTF-8 character. There are Unicode characters and there are encodings for Unicode characters such as UTF-8.

What you probably want is to decode several bytes - encoded in UTF-8 and representing a single Unicode character - into the Unicode code point.

There's lot of C source code for this available in the net. Just google for UTF-8 decoding C.

Update:

What you're obviously looking for is a UTF-8 decoding for more than just one character, namely a function decoding an array of bytes (UTF-8 decoded text) into an array of ints (Unicode code points).

The answer remains the same: use Google. There's lot of C code for it out there.

回答3:

C and C++ model is that the encoding is tied to the locale, so code using that model works for the encoding of the locale, whatever it is.

If you have a locale using UTF8 for the narrow encoding. See mbtowc(), mbrtowc(), mbstowcs and mbsrtocws(),they should be pretty straightforward to use.

回答4:

With icu, you can skip through utf8 characters with U8_NEXT

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <unicode/utf.h>
#include <unicode/ustring.h>

int main(int argc, char **argv)
{
    const char s[] = "日本語";

    UChar32 c;
    int32_t k;
    int32_t len = strlen(s);

    for (k = 0; k < len;) {
        U8_NEXT(s, k, len, c);
        printf("%d - %x\n", k, c);
    }

    return 0;

}

To compile with gcc utf.c -o utf $(icu-config --ldflags --ldflags-icuio)

The index k here indicates the starting offset of the encoding of your jth character. And c contains the unicode value (32 bits) of the character.

来源：https://stackoverflow.com/questions/14056230/how-to-get-the-value-of-utf-8-character

标签

utf-8