Python 2.7: How to convert unicode escapes in a string into actual utf-8 characters

笑着哭i 提交于 2019-11-30 07:39:23

问题


I use python 2.7 and I'm receiving a string from a server (not in unicode!). Inside that string I find text with unicode escape sequences. For example like this:

<a href = "http://www.mypage.com/\u0441andmoretext">\u00b2<\a>

How do I convert those \uxxxx - back to utf-8? The answers I found were either dealing with &# or required eval() which is too slow for my purposes. I need a universal solution for any text containing such sequenes.

Edit: <\a> is a typo but I want a tolerance against such typos as well. There should only be reaction to \u

The example text is meant in proper python syntax like this:

"<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"

The desired output is in proper python syntax

"<a href = \"http://www.mypage.com/\xd1\x81andmoretext\">\xc2\xb2<\\a>"

回答1:


Try

>>> s = "<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
>>> s.decode("raw_unicode_escape")
u'<a href = "http://www.mypage.com/\u0441andmoretext">\xb2<\\a>'

And then you can encode to utf8 as usual.




回答2:


Python does contain some special string codecs for cases like this.

In this case, if there are no other characters outside the 32-127 range, you can safely decode your byte-string using the "unicode_escape" codec to have a proper Unicode text object in Python. (On which your program should be performing all textual operations) - Whenever you are outputting that text again, you convert it to utf-8 as usual:

rawtext = r"""<a href="http://www.mypage.com/\u0441andmoretext">\u00b2<\a>"""
text = rawtext.decode("unicode_escape")
# Text operations go here
...
output_text = text.encode("utf-8")

If there are othe bytes outside the 32-127 range, the unicode_escape codec assumes them to be in the latin1 encoding. So if your response mixes utf-8 and these \uXXXX sequences you have to:

  1. decode the original string using utf-8
  2. encode back to latin1
  3. decode using "unicode_escape"
  4. work on the text
  5. encode back to utf-8


来源:https://stackoverflow.com/questions/29805425/python-2-7-how-to-convert-unicode-escapes-in-a-string-into-actual-utf-8-charact

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!