I try to parse text and find some characters in it. I use the code below. It works with normal characters like abcdef
but it does not work with öçşğüı
To go through each of the characters in the string, you can use mblen
. You also need to set the correct locale (the encoding represented by the multi byte string), so that mblen
can correctly parse the multi byte string.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <locale.h>
int main()
{
char * text = "öçşğü";
int i=0, char_len;
setlocale(LC_CTYPE, "en_US.utf8");
while ((char_len = mblen(&text[i], MB_CUR_MAX)) > 0)
{
/* &text[i] contains multibyte character of length char_len */
if(memcmp(&text[i], "ö", char_len) == 0)
{
printf("ö \n");
}
i += char_len;
}
return 0;
}
There are 2 types of string representation, using multi-byte (8-bit bytes) or wide byte (size depends on platform). Multi-byte representation has the advantage it can be represented using char *
(usual c string as in your code), but has disadvantage that multiple bytes represent a character. Wide string is represented using wchar_t *
. wchar_t
has the advantage that one wchar_t is one character (However as @anatolyg pointed out, this assumption can still go wrong in platforms where wchar_t is not able to represent all possible characters).
Have you looked at your source code using a hex editor? The string "öçşğü"
actually is represented by multi byte string c3 b6 c3 a7 c5 9f c4 9f c3 bc
in memory (UTF-8 encoding), of course with zero termination. You see 5 characters just because the string is rendered correctly by your UTF-8 aware viewer/browser. It is simple to realize that strlen(text)
returns 10 for this, whereas the above code loops only 5 times.
If you use wide-byte string, it can be done as explained by @WillBriggs.
See wiki here: https://en.wikipedia.org/wiki/UTF-8 In particular, there is a table with the bit patterns.
Here's another way to scan/convert a utf-8 string into a codepoint
[not exact, just an example--refer to wiki]:
// utf8scan -- convert utf8 to codepoints (example)
char inpbuf[1000];
char uni[8];
typedef union {
char utf8[4];
unsigned int code;
} codepoint_t;
codepoint_t outbuf[1000];
// unidecode -- decode utf8 char into codepoint
// RETURNS: updated rhs pointer
char *
unidecode(codepoint_t *lhs,char *rhs)
{
int idx;
int chr;
idx = 0;
lhs->utf8[idx++] = *rhs++;
for (; ; ++rhs, ++idx) {
chr = *rhs;
// end of string
if (chr == 0)
break;
// start of new ascii char
if ((chr & 0x80) == 0)
break;
// start of new unicode char
if (chr & 0x40)
break;
lhs->utf8[idx] = chr;
}
return rhs;
}
// main -- main program
int
main(void)
{
char *rhs;
codepoint_t *lhs;
rhs = inpbuf;
lhs = outbuf;
for (; *rhs != 0; ++lhs) {
lhs->code = 0;
// ascii char
if ((*rhs & 0x80) == 0)
lhs->utf8[0] = *rhs++;
// get/skip unicode char
else
rhs = unidecode(lhs,rhs);
}
// add EOS
lhs->code = 0;
return 0;
}
The best way to handle wide characters is as, well, wide characters.
wchar_t myWord[] = L"Something";
This will do it:
#include <stdio.h>
#include <ctype.h>
#include <string.h>
int main()
{
wchar_t * text = L"öçşğü";
int i = 0;
while (text[i])
{
if (text[i] == L'ö')
{
wprintf(L"ö \n");
}
i++;
}
return 0;
}
If you're in Visual Studio, like me, recall that the console window doesn't handle Unicode well. You can redirect it to a file and examine the file, and see the ö
.
There are no standards surrounding embedding non-ASCII characters directly in your source file.
Instead, the C11 standard specifies that you can use Unicode code points:
wchar_t text[] = L"\u00f6\u00e7\u015f\u0131\u011f";
// Print whole string
wprintf(L"%s\n", text);
// Test individual characters
for (size_t i = 0; text[i]; ++i)
{
if ( text[i] == u'\u00f6' )
// whatever...
}
If you are in Windows then you face an extra problem that the Windows console can't print Unicode characters by default. You need to do the following:
_setmode(1, _O_WTEXT);
, which will need #include <fcntl.h>
.To restore normal text afterwards you can _setmode(1, _O_TEXT);
.
Of course, if you are outputting to a file or to a Win32 API function then you don't need to do those steps.