UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-6: invalid data

后端 未结 8 1199
谎友^
谎友^ 2020-11-30 00:46

how does the unicode thing works on python2? i just dont get it.

here i download data from a server and parse it for JSON.

Traceback (most recent cal         


        
相关标签:
8条回答
  • 2020-11-30 01:50

    The solution to change the encoding to Latin1 / ISO-8859-1 solves an issue I observed with html2text.py as invoked on an output of tex4ht. I use that for an automated word count on LaTeX documents: tex4ht converts them to HTML, and then html2text.py strips them down to pure text for further counting through wc -w. Now, if, for example, a German "Umlaut" comes in through a literature database entry, that process would fail as html2text.py would complain e.g.

    UnicodeDecodeError: 'utf8' codec can't decode bytes in position 32243-32245: invalid data

    Now these errors would then subsequently be particularly hard to track down, and essentially you want to have the Umlaut in your references section. A simple change inside html2text.py from

    data = data.decode(encoding)

    to

    data = data.decode("ISO-8859-1")

    solves that issue; if you're calling the script using the HTML file as first parameter, you can also pass the encoding as second parameter and spare the modification.

    0 讨论(0)
  • 2020-11-30 01:51

    Temporary workaround: unicode(urllib2.urlopen(url).read(), 'utf8') - this should work if what is returned is UTF-8.

    urlopen().read() return bytes and you have to decode them to unicode strings. Also it would be helpful to check the patch from http://bugs.python.org/issue4733

    0 讨论(0)
提交回复
热议问题