Retrieving JSON objects from a text file (using Python)

后端 未结 9 1537
失恋的感觉
失恋的感觉 2020-11-30 04:50

I have thousands of text files containing multiple JSON objects, but unfortunately there is no delimiter between the objects. Objects are stored as dictionaries and some of

相关标签:
9条回答
  • 2020-11-30 05:24

    How about something like this:

    import re
    import json
    
    jsonstr = open('test.json').read()
    
    p = re.compile( '}\s*{' )
    jsonstr = p.sub( '}\n{', jsonstr )
    
    jsonarr = jsonstr.split( '\n' )
    
    for jsonstr in jsonarr:
       jsonobj = json.loads( jsonstr )
       print json.dumps( jsonobj )
    
    0 讨论(0)
  • 2020-11-30 05:31
    import json
    
    file1 = open('filepath', 'r')
    data = file1.readlines()
    
    for line in data :
       values = json.loads(line)
    
    '''Now you can access all the objects using values.get('key') '''
    
    0 讨论(0)
  • 2020-11-30 05:41

    Solution

    As far as I know }{ does not appear in valid JSON, so the following should be perfectly safe when trying to get strings for separate objects that were concatenated (txt is the content of your file). It does not require any import (even of re module) to do that:

    retrieved_strings = map(lambda x: '{'+x+'}', txt.strip('{}').split('}{'))
    

    or if you prefer list comprehensions (as David Zwicker mentioned in the comments), you can use it like that:

    retrieved_strings = ['{'+x+'}' for x in txt.strip('{}').split('}{'))]
    

    It will result in retrieved_strings being a list of strings, each containing separate JSON object. See proof here: http://ideone.com/Purpb

    Example

    The following string:

    '{field1:"a",field2:"b"}{field1:"c",field2:"d"}{field1:"e",field2:"f"}'
    

    will be turned into:

    ['{field1:"a",field2:"b"}', '{field1:"c",field2:"d"}', '{field1:"e",field2:"f"}']
    

    as proven in the example I mentioned.

    0 讨论(0)
  • 2020-11-30 05:45

    Suppose you added a [ to the start of the text in a file, and used a version of json.load() which, when it detected the error of finding a { instead of an expected comma (or hits the end of the file), spit out the just-completed object?

    0 讨论(0)
  • 2020-11-30 05:46

    Sebastian Blask has the right idea, but there's no reason to use regexes for such a simple change.

    objs = json.loads("[%s]"%(open('your_file.name').read().replace('}{', '},{')))
    

    Or, more legibly

    raw_objs_string = open('your_file.name').read() #read in raw data
    raw_objs_string = raw_objs_string.replace('}{', '},{') #insert a comma between each object
    objs_string = '[%s]'%(raw_objs_string) #wrap in a list, to make valid json
    objs = json.loads(objs_string) #parse json
    
    0 讨论(0)
  • 2020-11-30 05:47

    This decodes your "list" of JSON Objects from a string:

    from json import JSONDecoder
    
    def loads_invalid_obj_list(s):
        decoder = JSONDecoder()
        s_len = len(s)
    
        objs = []
        end = 0
        while end != s_len:
            obj, end = decoder.raw_decode(s, idx=end)
            objs.append(obj)
    
        return objs
    

    The bonus here is that you play nice with the parser. Hence it keeps telling you exactly where it found an error.

    Examples

    >>> loads_invalid_obj_list('{}{}')
    [{}, {}]
    
    >>> loads_invalid_obj_list('{}{\n}{')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "decode.py", line 9, in loads_invalid_obj_list
        obj, end = decoder.raw_decode(s, idx=end)
      File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
        obj, end = self.scan_once(s, idx)
    ValueError: Expecting object: line 2 column 2 (char 5)
    

    Clean Solution (added later)

    import json
    import re
    
    #shameless copy paste from json/decoder.py
    FLAGS = re.VERBOSE | re.MULTILINE | re.DOTALL
    WHITESPACE = re.compile(r'[ \t\n\r]*', FLAGS)
    
    class ConcatJSONDecoder(json.JSONDecoder):
        def decode(self, s, _w=WHITESPACE.match):
            s_len = len(s)
    
            objs = []
            end = 0
            while end != s_len:
                obj, end = self.raw_decode(s, idx=_w(s, end).end())
                end = _w(s, end).end()
                objs.append(obj)
            return objs
    

    Examples

    >>> print json.loads('{}', cls=ConcatJSONDecoder)
    [{}]
    
    >>> print json.load(open('file'), cls=ConcatJSONDecoder)
    [{}]
    
    >>> print json.loads('{}{} {', cls=ConcatJSONDecoder)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 339, in loads
        return cls(encoding=encoding, **kw).decode(s)
      File "decode.py", line 15, in decode
        obj, end = self.raw_decode(s, idx=_w(s, end).end())
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
        obj, end = self.scan_once(s, idx)
    ValueError: Expecting object: line 1 column 5 (char 5)
    
    0 讨论(0)
提交回复
热议问题