How to get string objects instead of Unicode from JSON?

前端 未结 21 893
伪装坚强ぢ
伪装坚强ぢ 2020-11-22 14:43

I\'m using Python 2 to parse JSON from ASCII encoded text files.

When loading these files with either json or simplejson, all my

相关标签:
21条回答
  • 2020-11-22 15:08

    So, I've run into the same problem. Guess what was the first Google result.

    Because I need to pass all data to PyGTK, unicode strings aren't very useful to me either. So I have another recursive conversion method. It's actually also needed for typesafe JSON conversion - json.dump() would bail on any non-literals, like Python objects. Doesn't convert dict indexes though.

    # removes any objects, turns unicode back into str
    def filter_data(obj):
            if type(obj) in (int, float, str, bool):
                    return obj
            elif type(obj) == unicode:
                    return str(obj)
            elif type(obj) in (list, tuple, set):
                    obj = list(obj)
                    for i,v in enumerate(obj):
                            obj[i] = filter_data(v)
            elif type(obj) == dict:
                    for i,v in obj.iteritems():
                            obj[i] = filter_data(v)
            else:
                    print "invalid object in data, converting to string"
                    obj = str(obj) 
            return obj
    
    0 讨论(0)
  • 2020-11-22 15:10

    That's because json has no difference between string objects and unicode objects. They're all strings in javascript.

    I think JSON is right to return unicode objects. In fact, I wouldn't accept anything less, since javascript strings are in fact unicode objects (i.e. JSON (javascript) strings can store any kind of unicode character) so it makes sense to create unicode objects when translating strings from JSON. Plain strings just wouldn't fit since the library would have to guess the encoding you want.

    It's better to use unicode string objects everywhere. So your best option is to update your libraries so they can deal with unicode objects.

    But if you really want bytestrings, just encode the results to the encoding of your choice:

    >>> nl = json.loads(js)
    >>> nl
    [u'a', u'b']
    >>> nl = [s.encode('utf-8') for s in nl]
    >>> nl
    ['a', 'b']
    
    0 讨论(0)
  • 2020-11-22 15:12

    With Python 3.6, sometimes I still run into this problem. For example, when getting response from a REST API and loading the response text to JSON, I still get the unicode strings. Found a simple solution using json.dumps().

    response_message = json.loads(json.dumps(response.text))
    print(response_message)
    
    0 讨论(0)
  • 2020-11-22 15:15

    I had a JSON dict as a string. The keys and values were unicode objects like in the following example:

    myStringDict = "{u'key':u'value'}"
    

    I could use the byteify function suggested above by converting the string to a dict object using ast.literal_eval(myStringDict).

    0 讨论(0)
  • 2020-11-22 15:17

    I'm afraid there's no way to achieve this automatically within the simplejson library.

    The scanner and decoder in simplejson are designed to produce unicode text. To do this, the library uses a function called c_scanstring (if it's available, for speed), or py_scanstring if the C version is not available. The scanstring function is called several times by nearly every routine that simplejson has for decoding a structure that might contain text. You'd have to either monkeypatch the scanstring value in simplejson.decoder, or subclass JSONDecoder and provide pretty much your own entire implementation of anything that might contain text.

    The reason that simplejson outputs unicode, however, is that the json spec specifically mentions that "A string is a collection of zero or more Unicode characters"... support for unicode is assumed as part of the format itself. Simplejson's scanstring implementation goes so far as to scan and interpret unicode escapes (even error-checking for malformed multi-byte charset representations), so the only way it can reliably return the value to you is as unicode.

    If you have an aged library that needs an str, I recommend you either laboriously search the nested data structure after parsing (which I acknowledge is what you explicitly said you wanted to avoid... sorry), or perhaps wrap your libraries in some sort of facade where you can massage the input parameters at a more granular level. The second approach might be more manageable than the first if your data structures are indeed deeply nested.

    0 讨论(0)
  • 2020-11-22 15:17

    As Mark (Amery) correctly notes: Using PyYaml's deserializer on a json dump works only if you have ASCII only. At least out of the box.

    Two quick comments on the PyYaml approach:

    1. NEVER use yaml.load on data from the field. Its a feature(!) of yaml to execute arbitrary code hidden within the structure.

    2. You can make it work also for non ASCII via this:

      def to_utf8(loader, node):
          return loader.construct_scalar(node).encode('utf-8')
      yaml.add_constructor(u'tag:yaml.org,2002:str', to_utf8)
      

    But performance wise its of no comparison to Mark Amery's answer:

    Throwing some deeply nested sample dicts onto the two methods, I get this (with dt[j] = time delta of json.loads(json.dumps(m))):

         dt[yaml.safe_load(json.dumps(m))] =~ 100 * dt[j]
         dt[byteify recursion(Mark Amery)] =~   5 * dt[j]
    

    So deserialization including fully walking the tree and encoding, well within the order of magnitude of json's C based implementation. I find this remarkably fast and its also more robust than the yaml load at deeply nested structures. And less security error prone, looking at yaml.load.

    => While I would appreciate a pointer to a C only based converter the byteify function should be the default answer.

    This holds especially true if your json structure is from the field, containing user input. Because then you probably need to walk anyway over your structure - independent on your desired internal data structures ('unicode sandwich' or byte strings only).

    Why?

    Unicode normalisation. For the unaware: Take a painkiller and read this.

    So using the byteify recursion you kill two birds with one stone:

    1. get your bytestrings from nested json dumps
    2. get user input values normalised, so that you find the stuff in your storage.

    In my tests it turned out that replacing the input.encode('utf-8') with a unicodedata.normalize('NFC', input).encode('utf-8') was even faster than w/o NFC - but thats heavily dependent on the sample data I guess.

    0 讨论(0)
提交回复
热议问题