I\'m having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.
The problem is that
You need to read the Python Unicode HOWTO. This error is the very first example.
Basically, stop using str
to convert from unicode to encoded text / bytes.
Instead, properly use .encode() to encode the string:
p.agent_info = u' '.join((agent_contact, agent_telno)).encode('utf-8').strip()
or work entirely in unicode.
I found elegant work around for me to remove symbols and continue to keep string as string in follows:
yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')
It's important to notice that using the ignore option is dangerous because it silently drops any unicode(and internationalization) support from the code that uses it, as seen here (convert unicode):
>>> u'City: Malmö'.encode('ascii', 'ignore').decode('ascii')
'City: Malm'
Please open terminal and fire the below command:
export LC_ALL="en_US.UTF-8"
For me, what worked was:
BeautifulSoup(html_text,from_encoding="utf-8")
Hope this helps someone.
Simple helper functions found here.
def safe_unicode(obj, *args):
""" return the unicode representation of obj """
try:
return unicode(obj, *args)
except UnicodeDecodeError:
# obj is byte string
ascii_text = str(obj).encode('string_escape')
return unicode(ascii_text)
def safe_str(obj):
""" return the byte string representation of obj """
try:
return str(obj)
except UnicodeEncodeError:
# obj is unicode
return unicode(obj).encode('unicode_escape')
Update for python 3.0 and later. Try the following in the python editor:
locale-gen en_US.UTF-8
export LANG=en_US.UTF-8 LANGUAGE=en_US.en
LC_ALL=en_US.UTF-8
This sets the system`s default locale encoding to the UTF-8 format.
More can be read here at PEP 538 -- Coercing the legacy C locale to a UTF-8 based locale.