UTF-8 strings in a MySQL database got messed up after configuration change

后端 未结 2 1499

I have a MySQL with strings that I left dormant for a while. Now that I picked it up again, I noticed that all the special characters are screwed up. My ISP has ported the serve

2条回答
  •  攒了一身酷
    2021-02-10 18:33

    C3 83 C6 92 C3 82 C2 AA
    

    This looks very much like UTF-8, so if we decode it, we get

    C3 3F C2 AA
    

    That's what you get if you treat the sequence of bytes as UTF-8, then encode it as ISO-8859-1. 3F is ?, which has been included as a replacement character, because UTF-8 C6 92 is U+0192 ƒ which does not exist in ISO-8859-1. But it does exist in Windows code page 1252 Western European, an encoding very similar to ISO-8859-1; there, it's byte 0x83.

    C3 83 C2 AA
    

    Go through another round of treat-as-UTF-8-bytes-and-encode-to-cp1252 and you get:

    C3 AA
    

    which is, finally, UTF-8 for ê.

    Note that even if you serve a non-XML HTML page explicitly as ISO-8859-1, browsers will actually use the cp1252 encoding, due to nasty historical reasons.

    Unfortunately MySQL doesn't have a cp1252 encoding; latin1 is (correctly) ISO-8859-1. So you won't be able to fix up the data by dumping as latin1 then reloading as utf8 (twice). You'd have to process the script with a text editor that can save as either (or eg in Python file(path, 'rb').read().decode('utf-8').encode('cp1252').decode('utf-8').encode('cp1252')).

提交回复
热议问题