问题
There are many questions about utf-8 > unicode conversion, but I still haven't found answer for my issue.
Lets have strings like this:
a = "Je-li pro za\\xc5\\x99azov\\xc3\\xa1n\\xc3\\xad"
Python 3.6 understands this string like Je-li pro za\xc5\x99azov\xc3\xa1n\xc3\xad. I need to convert this utf-8-like string to unicode representation. The final result should be Je-li pro zařazování.
With a.decode("utf-8")
I get AttributeError: 'str' object has no attribute 'decode', because Python means the object is already decoded.
If I convert it to bytes first with bytes(a, "utf-8")
, the backslashes are doubled only and .decode("utf-8")
returns it to my current a
again.
How to get unicode string Je-li pro zařazování from this a
?
回答1:
You have to encode/decode 4 times to get the desired result:
print(
"Je-li pro za\\xc5\\x99azov\\xc3\\xa1n\\xc3\\xad"
# actually any encoding support printable ASCII would work, for example utf-8
.encode('ascii')
# unescape the string
# source: https://stackoverflow.com/a/1885197
.decode('unicode-escape')
# latin-1 also works, see https://stackoverflow.com/q/7048745
.encode('iso-8859-1')
# finally
.decode('utf-8')
)
Try it online!
Besides, consider telling your target program (data source) to give different output format (byte array or base64 encoded, for example), if you can.
The unsafe-but-shorter way:
st = "Je-li pro za\\xc5\\x99azov\\xc3\\xa1n\\xc3\\xad"
print(eval("b'"+st+"'").decode('utf-8'))
Try it online!
There are ast.literal_eval
, but it may not worth using here.
来源:https://stackoverflow.com/questions/49756071/python-3-6-utf-8-to-unicode-conversion-string-with-double-backslashes