I\'m having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup.
The problem is that
The problem is that you're trying to print a unicode character, but your terminal doesn't support it.
You can try installing language-pack-en
package to fix that:
sudo apt-get install language-pack-en
which provides English translation data updates for all supported packages (including Python). Install different language package if necessary (depending which characters you're trying to print).
On some Linux distributions it's required in order to make sure that the default English locales are set-up properly (so unicode characters can be handled by shell/terminal). Sometimes it's easier to install it, than configuring it manually.
Then when writing the code, make sure you use the right encoding in your code.
For example:
open(foo, encoding='utf-8')
If you've still a problem, double check your system configuration, such as:
Your locale file (/etc/default/locale
), which should have e.g.
LANG="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
or:
LC_ALL=C.UTF-8
LANG=C.UTF-8
Value of LANG
/LC_CTYPE in shell.
Check which locale your shell supports by:
locale -a | grep "UTF-8"
Demonstrating the problem and solution in fresh VM.
Initialize and provision the VM (e.g. using vagrant):
vagrant init ubuntu/trusty64; vagrant up; vagrant ssh
See: available Ubuntu boxes..
Printing unicode characters (such as trade mark sign like ™
):
$ python -c 'print(u"\u2122");'
Traceback (most recent call last):
File "", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
Now installing language-pack-en
:
$ sudo apt-get -y install language-pack-en
The following extra packages will be installed:
language-pack-en-base
Generating locales...
en_GB.UTF-8... /usr/sbin/locale-gen: done
Generation complete.
Now problem should be solved:
$ python -c 'print(u"\u2122");'
™
Otherwise, try the following command:
$ LC_ALL=C.UTF-8 python -c 'print(u"\u2122");'
™