how does the unicode thing works on python2? i just dont get it.
here i download data from a server and parse it for JSON.
Traceback (most recent cal
The solution to change the encoding to Latin1 / ISO-8859-1 solves an issue I observed with html2text.py as invoked on an output of tex4ht. I use that for an automated word count on LaTeX documents: tex4ht converts them to HTML, and then html2text.py strips them down to pure text for further counting through wc -w. Now, if, for example, a German "Umlaut" comes in through a literature database entry, that process would fail as html2text.py would complain e.g.
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 32243-32245: invalid data
Now these errors would then subsequently be particularly hard to track down, and essentially you want to have the Umlaut in your references section. A simple change inside html2text.py from
data = data.decode(encoding)
to
data = data.decode("ISO-8859-1")
solves that issue; if you're calling the script using the HTML file as first parameter, you can also pass the encoding as second parameter and spare the modification.
Temporary workaround: unicode(urllib2.urlopen(url).read(), 'utf8')
- this should work if what is returned is UTF-8.
urlopen().read()
return bytes and you have to decode them to unicode strings. Also it would be helpful to check the patch from http://bugs.python.org/issue4733