Can't parse simple json with python

前端 未结 2 1531
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-10 09:21

I have a very simple json I can\'t parse with simplejson module. Reproduction:

import simplejson as json
json.loads(r\'{\"translatedatt1\":\"Vari\\351es\"}\'         


        
相关标签:
2条回答
  • 2020-12-10 09:48

    You probably did not intend to use a raw string, but a unicode string?

    >>> import simplejson as json
    >>> json.loads(u'{"translatedatt1":"Vari\351es"}')
    {u'translatedatt1': u'Vari\xe9es'}
    

    If you want to quote the data inside the JSON string you need to use \uNNNN:

    >>> json.loads(r'{"translatedatt1":"Vari\u351es"}')
    {'translatedatt1': u'Vari\u351es'}
    

    Please note that the resulting dict is slightly different in this case. When parsing a unicode string simplejson uses unicode strings for the keys. Otherwise it uses byte string keys.

    If your JSON data does in fact use \351e than it is simply broken and no valid JSON.

    0 讨论(0)
  • 2020-12-10 10:02

    That would be quite correct; Vari\351es contains an invalid escape, the JSON standard does not allow for a \ followed by just numbers.

    Whatever produced that code should be fixed. If that is impossible, you'll need to use a regular expression to either remove those escapes, or replace them with valid escapes.

    If we interpret the 351 number as an octal number, that would point to the unicode code point U+00E9, the é character (LATIN SMALL LETTER E WITH ACUTE). You can 'repair' your JSON input with:

    import re
    
    invalid_escape = re.compile(r'\\[0-7]{1,6}')  # up to 6 digits for codepoints up to FFFF
    
    def replace_with_codepoint(match):
        return unichr(int(match.group(0)[1:], 8))
    
    
    def repair(brokenjson):
        return invalid_escape.sub(replace_with_codepoint, brokenjson)
    

    Using repair() your example can be loaded:

    >>> json.loads(repair(r'{"translatedatt1":"Vari\351es"}'))
    {u'translatedatt1': u'Vari\xe9es'}
    

    You may need to adjust the interpretation of the codepoints; I choose octal (because Variées is an actual word), but you need to test this more with other codepoints.

    0 讨论(0)
提交回复
热议问题