How to compare multibyte characters in C

前端 未结 4 1215
感动是毒
感动是毒 2021-01-17 19:11

I try to parse text and find some characters in it. I use the code below. It works with normal characters like abcdef but it does not work with öçşğüı

4条回答
  •  滥情空心
    2021-01-17 20:02

    To go through each of the characters in the string, you can use mblen. You also need to set the correct locale (the encoding represented by the multi byte string), so that mblen can correctly parse the multi byte string.

    #include 
    #include 
    #include 
    #include 
    
    int main()
    {
        char * text = "öçşğü";
        int i=0, char_len;
    
        setlocale(LC_CTYPE, "en_US.utf8");
    
        while ((char_len = mblen(&text[i], MB_CUR_MAX)) > 0)
        {
            /* &text[i] contains multibyte character of length char_len */
            if(memcmp(&text[i], "ö", char_len) == 0)
            {
                printf("ö \n");
            }
    
            i += char_len;
        }
    
        return 0;
    }
    

    There are 2 types of string representation, using multi-byte (8-bit bytes) or wide byte (size depends on platform). Multi-byte representation has the advantage it can be represented using char * (usual c string as in your code), but has disadvantage that multiple bytes represent a character. Wide string is represented using wchar_t *. wchar_t has the advantage that one wchar_t is one character (However as @anatolyg pointed out, this assumption can still go wrong in platforms where wchar_t is not able to represent all possible characters).

    Have you looked at your source code using a hex editor? The string "öçşğü" actually is represented by multi byte string c3 b6 c3 a7 c5 9f c4 9f c3 bc in memory (UTF-8 encoding), of course with zero termination. You see 5 characters just because the string is rendered correctly by your UTF-8 aware viewer/browser. It is simple to realize that strlen(text) returns 10 for this, whereas the above code loops only 5 times.

    If you use wide-byte string, it can be done as explained by @WillBriggs.

提交回复
热议问题