I\'m having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.
The problem is that
This problem often happens when a django project deploys using Apache. Because Apache sets environment variable LANG=C in /etc/sysconfig/httpd. Just open the file and comment (or change to your flavior) this setting. Or use the lang option of the WSGIDaemonProcess command, in this case you will be able to set different LANG environment variable to different virtualhosts.
well i tried everything but it did not help, after googling around i figured the following and it helped. python 2.7 is in use.
# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
The problem is that you're trying to print a unicode character, but your terminal doesn't support it.
You can try installing language-pack-en
package to fix that:
sudo apt-get install language-pack-en
which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you're trying to print).
On some Linux distributions it's required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it's easier to install it, than configuring it manually.
Then when writing the code, make sure you use the right encoding in your code.
For example:
open(foo, encoding='utf-8')
If you've still a problem, double check your system configuration, such as:
Your locale file (/etc/default/locale
), which should have e.g.
LANG="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
or:
LC_ALL=C.UTF-8
LANG=C.UTF-8
Value of LANG
/LC_CTYPE in shell.
Check which locale your shell supports by:
locale -a | grep "UTF-8"
Demonstrating the problem and solution in fresh VM.
Initialize and provision the VM (e.g. using vagrant):
vagrant init ubuntu/trusty64; vagrant up; vagrant ssh
See: available Ubuntu boxes..
Printing unicode characters (such as trade mark sign like ™
):
$ python -c 'print(u"\u2122");'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
Now installing language-pack-en
:
$ sudo apt-get -y install language-pack-en
The following extra packages will be installed:
language-pack-en-base
Generating locales...
en_GB.UTF-8... /usr/sbin/locale-gen: done
Generation complete.
Now problem should be solved:
$ python -c 'print(u"\u2122");'
™
Otherwise, try the following command:
$ LC_ALL=C.UTF-8 python -c 'print(u"\u2122");'
™
Add line below at the beginning of your script ( or as second line):
# -*- coding: utf-8 -*-
That's definition of python source code encoding. More info in PEP 263.
Alas this works in Python 3 at least...
Python 3
Sometimes the error is in the enviroment variables and enconding so
import os
import locale
os.environ["PYTHONIOENCODING"] = "utf-8"
myLocale=locale.setlocale(category=locale.LC_ALL, locale="en_GB.UTF-8")
...
print(myText.encode('utf-8', errors='ignore'))
where errors are ignored in encoding.