Lazy Method for Reading Big File in Python?

前端 未结 12 1771
谎友^
谎友^ 2020-11-21 06:40

I have a very big file 4GB and when I try to read it my computer hangs. So I want to read it piece by piece and after processing each piece store the processed piece into an

相关标签:
12条回答
  • 2020-11-21 06:56

    To write a lazy function, just use yield:

    def read_in_chunks(file_object, chunk_size=1024):
        """Lazy function (generator) to read a file piece by piece.
        Default chunk size: 1k."""
        while True:
            data = file_object.read(chunk_size)
            if not data:
                break
            yield data
    
    
    with open('really_big_file.dat') as f:
        for piece in read_in_chunks(f):
            process_data(piece)
    

    Another option would be to use iter and a helper function:

    f = open('really_big_file.dat')
    def read1k():
        return f.read(1024)
    
    for piece in iter(read1k, ''):
        process_data(piece)
    

    If the file is line-based, the file object is already a lazy generator of lines:

    for line in open('really_big_file.dat'):
        process_data(line)
    
    0 讨论(0)
  • 2020-11-21 06:56

    I'm in a somewhat similar situation. It's not clear whether you know chunk size in bytes; I usually don't, but the number of records (lines) that is required is known:

    def get_line():
         with open('4gb_file') as file:
             for i in file:
                 yield i
    
    lines_required = 100
    gen = get_line()
    chunk = [i for i, j in zip(gen, range(lines_required))]
    

    Update: Thanks nosklo. Here's what I meant. It almost works, except that it loses a line 'between' chunks.

    chunk = [next(gen) for i in range(lines_required)]
    

    Does the trick w/o losing any lines, but it doesn't look very nice.

    0 讨论(0)
  • 2020-11-21 07:00

    If your computer, OS and python are 64-bit, then you can use the mmap module to map the contents of the file into memory and access it with indices and slices. Here an example from the documentation:

    import mmap
    with open("hello.txt", "r+") as f:
        # memory-map the file, size 0 means whole file
        map = mmap.mmap(f.fileno(), 0)
        # read content via standard file methods
        print map.readline()  # prints "Hello Python!"
        # read content via slice notation
        print map[:5]  # prints "Hello"
        # update content using slice notation;
        # note that new content must have same size
        map[6:] = " world!\n"
        # ... and read again using standard file methods
        map.seek(0)
        print map.readline()  # prints "Hello  world!"
        # close the map
        map.close()
    

    If either your computer, OS or python are 32-bit, then mmap-ing large files can reserve large parts of your address space and starve your program of memory.

    0 讨论(0)
  • 2020-11-21 07:00

    i am not allowed to comment due to my low reputation, but SilentGhosts solution should be much easier with file.readlines([sizehint])

    python file methods

    edit: SilentGhost is right, but this should be better than:

    s = "" 
    for i in xrange(100): 
       s += file.next()
    
    0 讨论(0)
  • 2020-11-21 07:00

    To process line by line, this is an elegant solution:

      def stream_lines(file_name):
        file = open(file_name)
        while True:
          line = file.readline()
          if not line:
            file.close()
            break
          yield line
    

    As long as there're no blank lines.

    0 讨论(0)
  • 2020-11-21 07:04
    f = ... # file-like object, i.e. supporting read(size) function and 
            # returning empty string '' when there is nothing to read
    
    def chunked(file, chunk_size):
        return iter(lambda: file.read(chunk_size), '')
    
    for data in chunked(f, 65536):
        # process the data
    

    UPDATE: The approach is best explained in https://stackoverflow.com/a/4566523/38592

    0 讨论(0)
提交回复
热议问题