Strange UTF8 string comparison

前端 未结 3 401
难免孤独
难免孤独 2021-01-14 00:43

I\'m having this problem with UTF8 string comparison which I really have no idea about and it starts to give me headache. Please help me out.
Basically I have this strin

相关标签:
3条回答
  • 2021-01-14 01:07

    This seems somewhat relevant. To simplify, there are several ways to get the same text in Unicode (and therefore UTF8): for example, this: ř can be written as one character ř or as two characters: r and the combining ˇ.

    Your best bet would be the normalizer class - normalize both strings to the same normalization form and compare the results.

    In one of the comments, you show these hex representations of the strings:

    4d696e61205469646967617265 20   616e7374 c3a4   6c6c6e696e676172  // from XML
    4d696e61205469646967617265 c2a0 616e7374 61cc88 6c6c6e696e676172 // typed
            ^^-----------------^^^^1         ^^^^^^2
    

    Note the parts I marked, apparently there are two parts to this problem.

    • For the first, observe this question on the meaning of byte sequence "c2a0" - for some reason, your typing is translated to a non-breakable space where the XML file has a normal space. Note that there's a normal space in both cases after "Mina". Not sure what to do about that in PHP, except to replace all whitespace with a normal space.

    • As to the second, that is the case I outlined above: c3a4 is ä (U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS" - one character, two bytes), whereas 61 is a (U+0061 "LATIN SMALL LETTER A" - one character, one byte) and cc88 would be the combining umlaut " (U+0308 "COMBINING DIAERESIS" - two characters, three bytes). Here, the normalization library should be useful.

    0 讨论(0)
  • 2021-01-14 01:17

    mb_detect_encoding($s, "UTF-8") == "UTF-8" ? : $s = utf8_encode($s);

    0 讨论(0)
  • 2021-01-14 01:20

    Let's try blindly: maybe both UTF-8 strings have not the same underlying representation (you can get characters with accents as a sequence or as a unique character). You should give use some hex dump of both UTF8 strings and someone may be able to help.

    0 讨论(0)
提交回复
热议问题