Python 2.7: How to convert unicode escapes in a string into actual utf-8 characters

笑着哭i 提交于 2019-11-29 05:13:43

Try

>>> s = "<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
>>> s.decode("raw_unicode_escape")
u'<a href = "http://www.mypage.com/\u0441andmoretext">\xb2<\\a>'

And then you can encode to utf8 as usual.

Python does contain some special string codecs for cases like this.

In this case, if there are no other characters outside the 32-127 range, you can safely decode your byte-string using the "unicode_escape" codec to have a proper Unicode text object in Python. (On which your program should be performing all textual operations) - Whenever you are outputting that text again, you convert it to utf-8 as usual:

rawtext = r"""<a href="http://www.mypage.com/\u0441andmoretext">\u00b2<\a>"""
text = rawtext.decode("unicode_escape")
# Text operations go here
...
output_text = text.encode("utf-8")

If there are othe bytes outside the 32-127 range, the unicode_escape codec assumes them to be in the latin1 encoding. So if your response mixes utf-8 and these \uXXXX sequences you have to:

  1. decode the original string using utf-8
  2. encode back to latin1
  3. decode using "unicode_escape"
  4. work on the text
  5. encode back to utf-8
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!