问题
I have an utf-8 character in chinese or arabic language. I need to get the value of that UTF-8 character, like getting a value of ASCII character. I need to implement it in "C". Can you please provide your suggestions?
For example:
char array[3] = "ab";
int v1,v2;
v1 = array[0];
v2 = array[1];
In the above code I will get corresponding ASCII values in v1 and v2. In the same way for UF8 string I need to get the value for each character in a string.
回答1:
Only the C11 standard version of the C language offers UTF-8 support, so depending on what standard you are targeting, you can use the C11 features (<uchar.h>
) or rely on a UTF library such as ICU.
回答2:
There is no such thing as a UTF-8 character. There are Unicode characters and there are encodings for Unicode characters such as UTF-8.
What you probably want is to decode several bytes - encoded in UTF-8 and representing a single Unicode character - into the Unicode code point.
There's lot of C source code for this available in the net. Just google for UTF-8 decoding C.
Update:
What you're obviously looking for is a UTF-8 decoding for more than just one character, namely a function decoding an array of bytes (UTF-8 decoded text) into an array of ints (Unicode code points).
The answer remains the same: use Google. There's lot of C code for it out there.
回答3:
C and C++ model is that the encoding is tied to the locale, so code using that model works for the encoding of the locale, whatever it is.
If you have a locale using UTF8 for the narrow encoding. See mbtowc()
, mbrtowc()
, mbstowcs
and mbsrtocws()
,they should be pretty straightforward to use.
回答4:
With icu, you can skip through utf8 characters with U8_NEXT
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <unicode/utf.h>
#include <unicode/ustring.h>
int main(int argc, char **argv)
{
const char s[] = "日本語";
UChar32 c;
int32_t k;
int32_t len = strlen(s);
for (k = 0; k < len;) {
U8_NEXT(s, k, len, c);
printf("%d - %x\n", k, c);
}
return 0;
}
To compile with gcc utf.c -o utf $(icu-config --ldflags --ldflags-icuio)
The index k
here indicates the starting offset of the encoding of your j
th character. And c
contains the unicode value (32 bits) of the character.
来源:https://stackoverflow.com/questions/14056230/how-to-get-the-value-of-utf-8-character