Although there are similar questions, I can\'t seem to find a working solution for my case:
I\'m encountering some annoying hex chars in strings, e.g.
\'
These are not "hex characters" but the internal representation (utf-8 encoded in the first case, unicode code point in the second case) of the unicode characters 'LEFT DOUBLE QUOTATION MARK' ('“') and 'RIGHT DOUBLE QUOTATION MARK' ('”').
>>> s = "\xe2\x80\x9chttp://www.google.com\xe2\x80\x9d blah blah#%#@$^blah"
>>> print s
“http://www.google.com” blah blah#%#@$^blah
>>> s.decode("utf-8")
u'\u201chttp://www.google.com\u201d blah blah#%#@$^blah'
>>> print s.decode("utf-8")
“http://www.google.com” blah blah#%#@$^blah
As how to remove them, they are just ordinary characters so a simple str.replace()
will do:
>>> s.replace("\xe2\x80\x9c", "").replace("\xe2\x80\x9d", "")
'http://www.google.com blah blah#%#@$^blah'
If you want to get rid of all non-ascii characters at once, you just have to decode to unicode then encode to ascii with the "ignore" parameter:
>>> s.decode("utf-8").encode("ascii", "ignore")
'http://www.google.com blah blah#%#@$^blah'