问题
I would like to know is there a method that can read multiple lines from a file batch by batch. For example:
with open(filename, 'rb') as f:
for n_lines in f:
process(n_lines)
In this function, what I would like to do is: for every iteration, next n lines will be read from the file, batch by batch.
Because one single file is too big. What I want to do is to read it part by part.
回答1:
itertools.islice and two arg iter
can be used to accomplish this, but it's a little funny:
from itertools import islice
n = 5 # Or whatever chunk size you want
with open(filename, 'rb') as f:
for n_lines in iter(lambda: tuple(islice(f, n)), ()):
process(n_lines)
This will keep islice
ing off n
lines at a time (using tuple
to actually force the whole chunk to be read in) until the f
is exhausted, at which point it will stop. The final chunk will be less than n
lines if the number of lines in the file isn't an even multiple of n
. If you want all the lines to be a single string, change the for
loop to be:
# The b prefixes are ignored on 2.7, and necessary on 3.x since you opened
# the file in binary mode
for n_lines in iter(lambda: b''.join(islice(f, n)), b''):
Another approach is to use izip_longest
for the purpose, which avoids lambda
functions:
from future_builtins import map # Only on Py2
from itertools import izip_longest # zip_longest on Py3
# gets tuples possibly padded with empty strings at end of file
for n_lines in izip_longest(*[f]*n, fillvalue=b''):
# Or to combine into a single string:
for n_lines in map(b''.join, izip_longest(*[f]*n, fillvalue=b'')):
回答2:
You can actually just iterate over lines in a file (see file.next docs - this also works on Python 3) like
with open(filename) as f:
for line in f:
something(line)
so your code can be rewritten to
n=5 # your batch size
with open(filename) as f:
batch=[]
for line in f:
batch.append(line)
if len(batch)==n:
process(batch)
batch=[]
process(batch) # this batch might be smaller or even empty
but normally just processing line-by-line is more convenient (first example)
If you dont care about how many lines are read exactly for each batch but just that it is not too much memory then use file.readlines with sizehint
like
size_hint=2<<24 # 16MB
with open(filename) as f:
while f: # not sure if this check works
process(f.readlines(size_hint))
来源:https://stackoverflow.com/questions/39549426/read-multiple-lines-from-a-file-batch-by-batch