“surrogateescape” cannot escape certain characters

后端 未结 3 1127
你的背包
你的背包 2021-01-17 18:30

Regarding reading and writing text files in Python, one of the main Python contributors mentions this regarding the surrogateescape Unicode Error Handler:

相关标签:
3条回答
  • 2021-01-17 18:44

    Why might the surrogateescape Unicode Error Handler be returning a character that is not ASCII?

    Because that's what it explicitly does. That way you can use the same error handler the other way and it will know what to do.

    3>> b"'Zo\xc3\xab\\'s'".decode('ascii', errors='surrogateescape')
    "'Zo\udcc3\udcab\\'s'"
    3>> "'Zo\udcc3\udcab\\'s'".encode('ascii', errors='surrogateescape')
    b"'Zo\xc3\xab\\'s'"
    
    0 讨论(0)
  • 2021-01-17 18:56

    A lone surrogate should NOT be encoded in UTF-8 -- which is precisely why it was used for the internal representation of invalid input.

    In real life, it is pretty common to get data that is invalid for the encoding it is "supposed" to be in. For example, this question was inspired by text that appears to be in Latin-1, when ASCII or UTF-8 was expected. I put "supposed" in quotes, because it is pretty common for the "encoding information" to just be a guess, perhaps unrelated to the actual file.

    By default, xml processing (and most unicode processing) is strict -- the entire process gives up even though it could process hundreds of other lines just fine.

    Decoding with errors=replace would turn that line into "Zo?'s Coffee House", which is an improvement. (Well, unless you tried to replace invalid characters with something else that isn't valid either -- and the official unicode replacement character isn't valid in ASCII, which is why a '?' is typically used for encoding.)

    surrogateescape is used when the programmer decides "You know what? I don't care if the data is garbage. Maybe I have the wrong codec ... so I'll just pass the unknown bytes along as-is." Python does have to store (but avoid interpreting) those bytes internally until they are passed along.

    Using unpaired surrogates allows Python to store the invalid bytes without extra escaping. Precisely because unpaired surrogates are invalid, they will never appear in valid input. (And if they occur anyhow, they'll be interpreted as a pair of unrecognized bytes, both of which get preserved for output.)

    The original poster's problem is that he was trying to print out that internal representation directly, instead of reversing the mapping first, and the internal representation had bytes that (intentionally) weren't valid ... so the default (strict) error handler refused.

    0 讨论(0)
  • 2021-01-17 18:57

    For what reason should a low-surrogate DCC3 be encoded in utf-8? This is not allowed and useless because a surrogate is NOT a character. Find the high-surrogate that belongs to the low-surrogate, decode its codepoint and then create the proper utf-8 sequence for the codepoint.

    0 讨论(0)
提交回复
热议问题