I have a very big file 4GB and when I try to read it my computer hangs. So I want to read it piece by piece and after processing each piece store the processed piece into an
To write a lazy function, just use yield:
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open('really_big_file.dat') as f:
for piece in read_in_chunks(f):
process_data(piece)
Another option would be to use iter and a helper function:
f = open('really_big_file.dat')
def read1k():
return f.read(1024)
for piece in iter(read1k, ''):
process_data(piece)
If the file is line-based, the file object is already a lazy generator of lines:
for line in open('really_big_file.dat'):
process_data(line)
I'm in a somewhat similar situation. It's not clear whether you know chunk size in bytes; I usually don't, but the number of records (lines) that is required is known:
def get_line():
with open('4gb_file') as file:
for i in file:
yield i
lines_required = 100
gen = get_line()
chunk = [i for i, j in zip(gen, range(lines_required))]
Update: Thanks nosklo. Here's what I meant. It almost works, except that it loses a line 'between' chunks.
chunk = [next(gen) for i in range(lines_required)]
Does the trick w/o losing any lines, but it doesn't look very nice.
If your computer, OS and python are 64-bit, then you can use the mmap module to map the contents of the file into memory and access it with indices and slices. Here an example from the documentation:
import mmap
with open("hello.txt", "r+") as f:
# memory-map the file, size 0 means whole file
map = mmap.mmap(f.fileno(), 0)
# read content via standard file methods
print map.readline() # prints "Hello Python!"
# read content via slice notation
print map[:5] # prints "Hello"
# update content using slice notation;
# note that new content must have same size
map[6:] = " world!\n"
# ... and read again using standard file methods
map.seek(0)
print map.readline() # prints "Hello world!"
# close the map
map.close()
If either your computer, OS or python are 32-bit, then mmap-ing large files can reserve large parts of your address space and starve your program of memory.
i am not allowed to comment due to my low reputation, but SilentGhosts solution should be much easier with file.readlines([sizehint])
python file methods
edit: SilentGhost is right, but this should be better than:
s = ""
for i in xrange(100):
s += file.next()
To process line by line, this is an elegant solution:
def stream_lines(file_name):
file = open(file_name)
while True:
line = file.readline()
if not line:
file.close()
break
yield line
As long as there're no blank lines.
f = ... # file-like object, i.e. supporting read(size) function and
# returning empty string '' when there is nothing to read
def chunked(file, chunk_size):
return iter(lambda: file.read(chunk_size), '')
for data in chunked(f, 65536):
# process the data
UPDATE: The approach is best explained in https://stackoverflow.com/a/4566523/38592