问题
I use python 2.7 and I'm receiving a string from a server (not in unicode!). Inside that string I find text with unicode escape sequences. For example like this:
<a href = "http://www.mypage.com/\u0441andmoretext">\u00b2<\a>
How do I convert those \uxxxx
- back to utf-8? The answers I found were either dealing with &#
or required eval()
which is too slow for my purposes. I need a universal solution for any text containing such sequenes.
Edit:
<\a>
is a typo but I want a tolerance against such typos as well. There should only be reaction to \u
The example text is meant in proper python syntax like this:
"<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
The desired output is in proper python syntax
"<a href = \"http://www.mypage.com/\xd1\x81andmoretext\">\xc2\xb2<\\a>"
回答1:
Try
>>> s = "<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
>>> s.decode("raw_unicode_escape")
u'<a href = "http://www.mypage.com/\u0441andmoretext">\xb2<\\a>'
And then you can encode to utf8 as usual.
回答2:
Python does contain some special string codecs for cases like this.
In this case, if there are no other characters outside the 32-127 range, you can safely decode your byte-string using the "unicode_escape" codec to have a proper Unicode text object in Python. (On which your program should be performing all textual operations) - Whenever you are outputting that text again, you convert it to utf-8 as usual:
rawtext = r"""<a href="http://www.mypage.com/\u0441andmoretext">\u00b2<\a>"""
text = rawtext.decode("unicode_escape")
# Text operations go here
...
output_text = text.encode("utf-8")
If there are othe bytes outside the 32-127 range, the unicode_escape codec assumes them to be in the latin1 encoding. So if your response mixes utf-8 and these \uXXXX sequences you have to:
- decode the original string using utf-8
- encode back to latin1
- decode using "unicode_escape"
- work on the text
- encode back to utf-8
来源:https://stackoverflow.com/questions/29805425/python-2-7-how-to-convert-unicode-escapes-in-a-string-into-actual-utf-8-charact