How to get string objects instead of Unicode from JSON?

前端 未结 21 841
伪装坚强ぢ
伪装坚强ぢ 2020-11-22 14:43

I\'m using Python 2 to parse JSON from ASCII encoded text files.

When loading these files with either json or simplejson, all my

相关标签:
21条回答
  • 2020-11-22 15:34

    I've adapted the code from the answer of Mark Amery, particularly in order to get rid of isinstance for the pros of duck-typing.

    The encoding is done manually and ensure_ascii is disabled. The python docs for json.dump says that

    If ensure_ascii is True (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences

    Disclaimer: in the doctest I used the Hungarian language. Some notable Hungarian-related character encodings are: cp852 the IBM/OEM encoding used eg. in DOS (sometimes referred as ascii, incorrectly I think, it is dependent on the codepage setting), cp1250 used eg. in Windows (sometimes referred as ansi, dependent on the locale settings), and iso-8859-2, sometimes used on http servers. The test text Tüskéshátú kígyóbűvölő is attributed to Koltai László (native personal name form) and is from wikipedia.

    # coding: utf-8
    """
    This file should be encoded correctly with utf-8.
    """
    import json
    
    def encode_items(input, encoding='utf-8'):
        u"""original from: https://stackoverflow.com/a/13101776/611007
        adapted by SO/u/611007 (20150623)
        >>> 
        >>> ## run this with `python -m doctest <this file>.py` from command line
        >>> 
        >>> txt = u"Tüskéshátú kígyóbűvölő"
        >>> txt2 = u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"
        >>> txt3 = u"uúuutifu"
        >>> txt4 = b'u\\xfauutifu'
        >>> # txt4 shouldn't be 'u\\xc3\\xbauutifu', string content needs double backslash for doctest:
        >>> assert u'\\u0102' not in b'u\\xfauutifu'.decode('cp1250')
        >>> txt4u = txt4.decode('cp1250')
        >>> assert txt4u == u'u\\xfauutifu', repr(txt4u)
        >>> txt5 = b"u\\xc3\\xbauutifu"
        >>> txt5u = txt5.decode('utf-8')
        >>> txt6 = u"u\\u251c\\u2551uutifu"
        >>> there_and_back_again = lambda t: encode_items(t, encoding='utf-8').decode('utf-8')
        >>> assert txt == there_and_back_again(txt)
        >>> assert txt == there_and_back_again(txt2)
        >>> assert txt3 == there_and_back_again(txt3)
        >>> assert txt3.encode('cp852') == there_and_back_again(txt4u).encode('cp852')
        >>> assert txt3 == txt4u,(txt3,txt4u)
        >>> assert txt3 == there_and_back_again(txt5)
        >>> assert txt3 == there_and_back_again(txt5u)
        >>> assert txt3 == there_and_back_again(txt4u)
        >>> assert txt3.encode('cp1250') == encode_items(txt4, encoding='utf-8')
        >>> assert txt3.encode('utf-8') == encode_items(txt5, encoding='utf-8')
        >>> assert txt2.encode('utf-8') == encode_items(txt, encoding='utf-8')
        >>> assert {'a':txt2.encode('utf-8')} == encode_items({'a':txt}, encoding='utf-8')
        >>> assert [txt2.encode('utf-8')] == encode_items([txt], encoding='utf-8')
        >>> assert [[txt2.encode('utf-8')]] == encode_items([[txt]], encoding='utf-8')
        >>> assert [{'a':txt2.encode('utf-8')}] == encode_items([{'a':txt}], encoding='utf-8')
        >>> assert {'b':{'a':txt2.encode('utf-8')}} == encode_items({'b':{'a':txt}}, encoding='utf-8')
        """
        try:
            input.iteritems
            return {encode_items(k): encode_items(v) for (k,v) in input.iteritems()}
        except AttributeError:
            if isinstance(input, unicode):
                return input.encode(encoding)
            elif isinstance(input, str):
                return input
            try:
                iter(input)
                return [encode_items(e) for e in input]
            except TypeError:
                return input
    
    def alt_dumps(obj, **kwargs):
        """
        >>> alt_dumps({'a': u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"})
        '{"a": "T\\xc3\\xbcsk\\xc3\\xa9sh\\xc3\\xa1t\\xc3\\xba k\\xc3\\xadgy\\xc3\\xb3b\\xc5\\xb1v\\xc3\\xb6l\\xc5\\x91"}'
        """
        if 'ensure_ascii' in kwargs:
            del kwargs['ensure_ascii']
        return json.dumps(encode_items(obj), ensure_ascii=False, **kwargs)
    

    I'd also like to highlight the answer of Jarret Hardie which references the JSON spec, quoting:

    A string is a collection of zero or more Unicode characters

    In my use-case I had files with json. They are utf-8 encoded files. ensure_ascii results in properly escaped but not very readable json files, that is why I've adapted Mark Amery's answer to fit my needs.

    The doctest is not particularly thoughtful but I share the code in the hope that it will useful for someone.

    0 讨论(0)
  • 2020-11-22 15:34

    Check out this answer to a similar question like this which states that

    The u- prefix just means that you have a Unicode string. When you really use the string, it won't appear in your data. Don't be thrown by the printed output.

    For example, try this:

    print mail_accounts[0]["i"]
    

    You won't see a u.

    0 讨论(0)
  • 2020-11-22 15:35

    This is late to the game, but I built this recursive caster. It works for my needs and I think it's relatively complete. It may help you.

    def _parseJSON(self, obj):
        newobj = {}
    
        for key, value in obj.iteritems():
            key = str(key)
    
            if isinstance(value, dict):
                newobj[key] = self._parseJSON(value)
            elif isinstance(value, list):
                if key not in newobj:
                    newobj[key] = []
                    for i in value:
                        newobj[key].append(self._parseJSON(i))
            elif isinstance(value, unicode):
                val = str(value)
                if val.isdigit():
                    val = int(val)
                else:
                    try:
                        val = float(val)
                    except ValueError:
                        val = str(val)
                newobj[key] = val
    
        return newobj
    

    Just pass it a JSON object like so:

    obj = json.loads(content, parse_float=float, parse_int=int)
    obj = _parseJSON(obj)
    

    I have it as a private member of a class, but you can repurpose the method as you see fit.

    0 讨论(0)
提交回复
热议问题