How to incrementally write into a json file

前端 未结 3 1770
我在风中等你
我在风中等你 2021-02-07 20:05

I am writing a program, which requires me to generate a very large json file. I know the traditional way is to dump a dictionary list using json.dump()

3条回答
  •  旧时难觅i
    2021-02-07 20:54

    I know this is a year late, but the issue is still open and I'm surprised the json.iterencode() was not mentioned.

    The potential problem with iterencode in this example, is that you would want to have an iterative handle on the large data set by using a generator, and json encode does not serialize generators.

    The way around this is to the subclass list type and override the __iter__ magic method so that you could yield the output of your generator.

    Here is an example of this list sub class.

    class StreamArray(list):
        """
        Converts a generator into a list object that can be json serialisable
        while still retaining the iterative nature of a generator.
    
        IE. It converts it to a list without having to exhaust the generator
        and keep it's contents in memory.
        """
        def __init__(self, generator):
            self.generator = generator
            self._len = 1
    
        def __iter__(self):
            self._len = 0
            for item in self.generator:
                yield item
                self._len += 1
    
        def __len__(self):
            """
            Json parser looks for a this method to confirm whether or not it can
            be parsed
            """
            return self._len
    

    The usage from here on is quite simple. Get the generator handle, pass it into the StreamArray class, pass the stream array object into iterencode() and iterate over the chunks. The chunks will be json formated output which can be directly written to file.

    Example usage:

    #Function that will iteratively generate a large set of data.
    def large_list_generator_func():
        for i in xrange(5):
            chunk = {'hello_world': i}
            print 'Yielding chunk: ', chunk
            yield chunk
    
    #Write the contents to file:
    with open('/tmp/streamed_write.json', 'w') as outfile:
        large_generator_handle = large_list_generator_func()
        stream_array = StreamArray(large_generator_handle)
        for chunk in json.JSONEncoder().iterencode(stream_array):
            print 'Writing chunk: ', chunk
            outfile.write(chunk)
    

    The output that shows yield and writes happen consecutively.

    Yielding chunk:  {'hello_world': 0}
    Writing chunk:  [
    Writing chunk:  {
    Writing chunk:  "hello_world"
    Writing chunk:  : 
    Writing chunk:  0
    Writing chunk:  }
    Yielding chunk:  {'hello_world': 1}
    Writing chunk:  , 
    Writing chunk:  {
    Writing chunk:  "hello_world"
    Writing chunk:  : 
    Writing chunk:  1
    Writing chunk:  }
    

提交回复
热议问题