I want to iterate over each line of an entire file. One way to do this is by reading the entire file, saving it to a list, then going over the line of interest. This method
Two memory efficient ways in ranked order (first is best) -
with
- supported from python 2.5 and aboveyield
if you really want to have control over how much to readwith
with
is the nice and efficient pythonic way to read large files. advantages - 1) file object is automatically closed after exiting from with
execution block. 2) exception handling inside the with
block. 3) memory for
loop iterates through the f
file object line by line. internally it does buffered IO (to optimized on costly IO operations) and memory management.
with open("x.txt") as f:
for line in f:
do something with data
yield
Sometimes one might want more fine-grained control over how much to read in each iteration. In that case use iter & yield. Note with this method one explicitly needs close the file at the end.
def readInChunks(fileObj, chunkSize=2048):
"""
Lazy function to read a file piece by piece.
Default chunk size: 2kB.
"""
while True:
data = fileObj.read(chunkSize)
if not data:
break
yield data
f = open('bigFile')
for chunk in readInChunks(f):
do_something(chunk)
f.close()
Pitfalls and for the sake of completeness - below methods are not as good or not as elegant for reading large files but please read to get rounded understanding.
In Python, the most common way to read lines from a file is to do the following:
for line in open('myfile','r').readlines():
do_something(line)
When this is done, however, the readlines()
function (same applies for read()
function) loads the entire file into memory, then iterates over it. A slightly better approach (the first mentioned two methods above are the best) for large files is to use the fileinput
module, as follows:
import fileinput
for line in fileinput.input(['myfile']):
do_something(line)
the fileinput.input()
call reads lines sequentially, but doesn't keep them in memory after they've been read or even simply so this, since file
in python is iterable.