How to compare multibyte characters in C

前端 未结 4 1214
感动是毒
感动是毒 2021-01-17 19:11

I try to parse text and find some characters in it. I use the code below. It works with normal characters like abcdef but it does not work with öçşğüı

相关标签:
4条回答
  • 2021-01-17 20:02

    To go through each of the characters in the string, you can use mblen. You also need to set the correct locale (the encoding represented by the multi byte string), so that mblen can correctly parse the multi byte string.

    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    #include <locale.h>
    
    int main()
    {
        char * text = "öçşğü";
        int i=0, char_len;
    
        setlocale(LC_CTYPE, "en_US.utf8");
    
        while ((char_len = mblen(&text[i], MB_CUR_MAX)) > 0)
        {
            /* &text[i] contains multibyte character of length char_len */
            if(memcmp(&text[i], "ö", char_len) == 0)
            {
                printf("ö \n");
            }
    
            i += char_len;
        }
    
        return 0;
    }
    

    There are 2 types of string representation, using multi-byte (8-bit bytes) or wide byte (size depends on platform). Multi-byte representation has the advantage it can be represented using char * (usual c string as in your code), but has disadvantage that multiple bytes represent a character. Wide string is represented using wchar_t *. wchar_t has the advantage that one wchar_t is one character (However as @anatolyg pointed out, this assumption can still go wrong in platforms where wchar_t is not able to represent all possible characters).

    Have you looked at your source code using a hex editor? The string "öçşğü" actually is represented by multi byte string c3 b6 c3 a7 c5 9f c4 9f c3 bc in memory (UTF-8 encoding), of course with zero termination. You see 5 characters just because the string is rendered correctly by your UTF-8 aware viewer/browser. It is simple to realize that strlen(text) returns 10 for this, whereas the above code loops only 5 times.

    If you use wide-byte string, it can be done as explained by @WillBriggs.

    0 讨论(0)
  • 2021-01-17 20:03

    See wiki here: https://en.wikipedia.org/wiki/UTF-8 In particular, there is a table with the bit patterns.

    Here's another way to scan/convert a utf-8 string into a codepoint [not exact, just an example--refer to wiki]:

    // utf8scan -- convert utf8 to codepoints (example)
    
    char inpbuf[1000];
    char uni[8];
    
    typedef union {
        char utf8[4];
        unsigned int code;
    } codepoint_t;
    
    codepoint_t outbuf[1000];
    
    // unidecode -- decode utf8 char into codepoint
    // RETURNS: updated rhs pointer
    char *
    unidecode(codepoint_t *lhs,char *rhs)
    {
        int idx;
        int chr;
    
        idx = 0;
        lhs->utf8[idx++] = *rhs++;
    
        for (;  ;  ++rhs, ++idx) {
            chr = *rhs;
    
            // end of string
            if (chr == 0)
                break;
    
            // start of new ascii char
            if ((chr & 0x80) == 0)
                break;
    
            // start of new unicode char
            if (chr & 0x40)
                break;
    
            lhs->utf8[idx] = chr;
        }
    
        return rhs;
    }
    
    // main -- main program
    int
    main(void)
    {
        char *rhs;
        codepoint_t *lhs;
    
        rhs = inpbuf;
        lhs = outbuf;
    
        for (;  *rhs != 0;  ++lhs) {
            lhs->code = 0;
    
            // ascii char
            if ((*rhs & 0x80) == 0)
                lhs->utf8[0] = *rhs++;
    
            // get/skip unicode char
            else
                rhs = unidecode(lhs,rhs);
        }
    
        // add EOS
        lhs->code = 0;
    
        return 0;
    }
    
    0 讨论(0)
  • 2021-01-17 20:04

    The best way to handle wide characters is as, well, wide characters.

    wchar_t myWord[] = L"Something";
    

    This will do it:

    #include <stdio.h>
    #include <ctype.h>
    #include <string.h>
    
    int main()
    {
        wchar_t * text = L"öçşğü";
        int i = 0;
    
        while (text[i])
        {
            if (text[i] == L'ö')
            {
                wprintf(L"ö \n");
            }
    
            i++;
        }
    
        return 0;
    }
    

    If you're in Visual Studio, like me, recall that the console window doesn't handle Unicode well. You can redirect it to a file and examine the file, and see the ö.

    0 讨论(0)
  • 2021-01-17 20:08

    There are no standards surrounding embedding non-ASCII characters directly in your source file.

    Instead, the C11 standard specifies that you can use Unicode code points:

    wchar_t text[] = L"\u00f6\u00e7\u015f\u0131\u011f";
    
    // Print whole string
    wprintf(L"%s\n", text);
    
    // Test individual characters
    for (size_t i = 0; text[i]; ++i)
    {
        if ( text[i] == u'\u00f6' )
            // whatever...
    }
    

    If you are in Windows then you face an extra problem that the Windows console can't print Unicode characters by default. You need to do the following:

    • Change the console to use a TrueType monospaced font which includes glyphs for the characters you are trying to print. (I used "DejaVu Sans Mono" for this example)
    • In the source code, call the function _setmode(1, _O_WTEXT); , which will need #include <fcntl.h>.

    To restore normal text afterwards you can _setmode(1, _O_TEXT);.

    Of course, if you are outputting to a file or to a Win32 API function then you don't need to do those steps.

    0 讨论(0)
提交回复
热议问题