decoding issue while parsing JSON [python]

后端 未结 1 1937
孤城傲影
孤城傲影 2020-12-21 20:58

I am reading a JSON file in Python which has lots of fields and values (~8000 records). Env: windows 10, python 3.6.4; code:

import json
json_data = json.loa         


        
相关标签:
1条回答
  • 2020-12-21 21:10

    The snippet you are asking about seems to have been double-encoded. Basically, whatever originally generated this data produced text in Latin-1 or some related encoding (Windows code page 1252?). It was then fed to a process which converts Latin-1 to UTF-8 ... twice.

    Of course, "converting" data which is already UTF-8 but telling the computer that it's Latin-1 just produces mojibake.

    The string INGL\xc3\x83\xc2\x89S suggests this analysis, if you can guess that it is supposed to say Inglés in upper case, and realize that the UTF-8 encoding for É is \xC3 \x89 and then examine which characters these two bytes encode in Latin-1 (or, as it happens, Unicode, which is a superset of Latin-1, though they are not compatible on the encoding level).

    Notice that being able to guess which string a problematic sequence is supposed to represent is the crucial step here; it also explains why including a representative snippet of the problematic data - with enough context! - is vital for debugging.

    Anyway, if the entire file has the same symptom, you should be able to undo the second, superfluous and incorrect round of re-encoding; though an error this far into the file makes me imagine it's probably a local problem with just one or a few records. Maybe they were merged from multiple input files, only one of which had this error. Then fixing it requires a fair bit of detective work, and manual editing, or identifying and fixing the erroneous source. A quick and dirty workaround is to simply manually remove any erroneous records.

    0 讨论(0)
提交回复
热议问题