(Note: This answer pertains to Python 2.7.11+.)
The answer at https://stackoverflow.com/a/1701378/257924 refers to the Unidecode package and is what I was looking for. In using that package, I also discovered the ultimate source of my confusion which is elaborated in-depth at https://pythonhosted.org/kitchen/unicode-frustrations.html#frustration-3-inconsistent-treatment-of-output and specifically this section:
Frustration #3: Inconsistent treatment of output
Alright, since the python community is moving to using unicode strings everywhere, we might as well convert everything to unicode strings and use that by default, right? Sounds good most of the time but
there’s at least one huge caveat to be aware of. Anytime you output text to the terminal or to a file, the text has to be converted into a byte str. Python will try to implicitly convert from unicode to
byte str... but it will throw an exception if the bytes are non-ASCII:
The following is my demonstration script to use it. The characters listed in the names
variable are the characters I do need to have translated into something readable, and not removed, for the types of web pages I am analyzing.
#!/bin/bash
# -*- mode: python; coding: utf-8 -*-
# The above coding is needed to to avoid this error: SyntaxError: Non-ASCII character '\xe2' in file ./unicodedata_normalize_test.py on line 9, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
import os
import re
import unicodedata
from unidecode import unidecode
names = [
'HYPHEN-MINUS',
'EM DASH',
'EN DASH',
'MINUS SIGN',
'APOSTROPHE',
'LEFT SINGLE QUOTATION MARK',
'RIGHT SINGLE QUOTATION MARK',
'LATIN SMALL LETTER A WITH ACUTE',
]
for name in names:
character = unicodedata.lookup(name)
unidecoded = unidecode(character)
print
print 'name ',name
print 'character ',character
print 'unidecoded',unidecoded
Sample output of the above script is:
censored@censored:~$ unidecode_test
name HYPHEN-MINUS
character -
unidecoded -
name EM DASH
character —
unidecoded --
name EN DASH
character –
unidecoded -
name MINUS SIGN
character −
unidecoded -
name APOSTROPHE
character '
unidecoded '
name LEFT SINGLE QUOTATION MARK
character ‘
unidecoded '
name RIGHT SINGLE QUOTATION MARK
character ’
unidecoded '
name LATIN SMALL LETTER A WITH ACUTE
character á
unidecoded a
The following more elaborate script loads several web pages with many unicode characters. See the comments in the script below:
#!/bin/bash
# -*- mode: python; coding: utf-8 -*-
import os
import re
import subprocess
import requests
from unidecode import unidecode
urls = [
'https://system76.com/laptops/kudu',
'https://stackoverflow.com/a/38249916/257924',
'https://www.peterbe.com/plog/unicode-to-ascii',
'https://stackoverflow.com/questions/227459/ascii-value-of-a-character-in-python?rq=1#comment35813354_227472',
# Uncomment out the following to show that this script works without throwing exceptions, but at the expense of a huge amount of diff output:
###'https://en.wikipedia.org/wiki/List_of_Unicode_characters',
]
# The following variable settings represent what just works without throwing exceptions.
# Setting re_encode to False and not_encode to True results in the write function throwing an exception of
#
# Traceback (most recent call last):
# File "./simple_wget.py", line 52, in
# file_fp.write(data[ext])
# UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 33511: ordinal not in range(128)
#
# This is the crux of my confusion and is explained by https://pythonhosted.org/kitchen/unicode-frustrations.html#frustration-3-inconsistent-treatment-of-output
# So this is why we set re_encode to True and not_encode to False below:
force_utf_8 = False
re_encode = True
not_encode = False
do_unidecode = True
for url in urls:
#
# Load the text from request as a true unicode string:
#
r = requests.get(url)
print "\n\n\n"
print "url:",url
print "current encoding:",r.encoding
data = {}
if force_utf_8:
# The next two lines do not work. They cause the write to fail:
r.encoding = "UTF-8"
data['old'] = r.text # ok, data is a true unicode string
if re_encode:
data['old'] = r.text.encode(r.encoding)
if not_encode:
data['old'] = r.text
if do_unidecode:
# translate offending characters in unicode:
data['new'] = unidecode(r.text)
html_base = re.sub(r'[^a-zA-Z0-9_-]+', '__', url)
diff_cmd = "diff "
for ext in [ 'old', 'new' ]:
if ext in data:
print "ext:",ext
html_file = "{}.{}.html".format(html_base, ext)
with open(html_file, 'w') as file_fp:
file_fp.write(data[ext])
print "Wrote",html_file
diff_cmd = diff_cmd + " " + html_file
if 'old' in data and 'new' in data:
print 'Executing:',diff_cmd
subprocess.call(diff_cmd, shell=True)
The gist showing the output of the above script. This shows the execution of the Linux diff
command on the "old" and "new" html files so as to see the translations. There is going to be mistranslation of languages like German etc., but that is fine for my purposes of getting some lossy translation of single and double quote types of characters and dash-like characters.