Python String Comparison--Problems With Special/Unicode Characters

后端 未结 4 1468
太阳男子
太阳男子 2021-02-04 20:57

I\'m writing a Python script to process some music data. It\'s supposed to merge two separate databases by comparing their entries and matching them up. It\'s almost working, bu

4条回答
  •  日久生厌
    2021-02-04 21:39

    Unicode vs Bytes

    First, some terminology. There are two types of strings, encoded and decoded:

    • Encoded. This is what's stored on disk. To Python, it's a bunch of 0's and 1's that you might treat like ASCII, but it could be anything -- binary data, a JPEG image, whatever. In Python 2.x, this is called a "string" variable. In Python 3.x, it's more accurately called a "bytes" variable.
    • Decoded. This is a string of actual characters. They could be encoded to 8-bit ASCII strings, or it could be encoded to 32-bit Chinese characters. But until it's time to convert to an encoded variable, it's just a Unicode string of characters.

    What this means to you

    So here's the thing. You said you were getting one ASCII variable and one Unicode variable. That's actually not true.

    • You have one variable that's a string of bytes -- ones and zeros, presumably in sets of 8. This is the variable you assumed, incorrectly, to be ASCII.
    • You have another variable that's Unicode data -- numbers, letters, and symbols.

    Before you compare the string of bytes to a Unicode string of characters, you have to make some assumptions. In your case, Python (and you) assumed that the string of bytes was ASCII encoded. That worked fine until you came across a character that wasn't ASCII -- a character with an accent mark.

    So you need to find out what that string of bytes is encoded as. It might be latin1. If it is, you want to do this:

    if unicode_variable == string_variable.decode('latin1')
    

    Latin1 is basically ASCII plus some extended characters like Ç and Â.

    If your data is in Latin1, that's all you need to do. But if your string of bytes is encoded in something else, you'll need to figure out what encoding that is and pass it to decode().

    The bottom line is, there's no easy answer, unless you know (or make some assumptions) about the encoding of your input data.

    What I would do

    Try running var.decode('latin1') on your string of bytes. That will give you a Unicode variable. If that works, and the data looks correct (ie, characters with accent marks look like they belong), roll with it.

    Oh, and if latin1 doesn't parse or doesn't look right, try utf8 -- another common encoding.

提交回复
热议问题