How to read a large file - line by line?

前端未结

关注

 11  878

I want to iterate over each line of an entire file. One way to do this is by reading the entire file, saving it to a list, then going over the line of interest. This method

相关标签:

11条回答

借酒劲吻你

2020-11-21 12:10
From the python documentation for fileinput.input():

This iterates over the lines of all files listed in sys.argv[1:], defaulting to sys.stdin if the list is empty

further, the definition of the function is:
```
fileinput.FileInput([files[, inplace[, backup[, mode[, openhook]]]]])
```
reading between the lines, this tells me that files can be a list so you could have something like:
```
for each_line in fileinput.input([input_file, input_file]):
  do_something(each_line)
```
See here for more information
0 讨论(0)
发布评论:

提交评论
- 加载中...
清酒与你

2020-11-21 12:11
Two memory efficient ways in ranked order (first is best) -
1. use of with - supported from python 2.5 and above
2. use of yield if you really want to have control over how much to read
1. use of with

with is the nice and efficient pythonic way to read large files. advantages - 1) file object is automatically closed after exiting from with execution block. 2) exception handling inside the with block. 3) memory for loop iterates through the f file object line by line. internally it does buffered IO (to optimized on costly IO operations) and memory management.
```
with open("x.txt") as f:
    for line in f:
        do something with data
```
2. use of yield

Sometimes one might want more fine-grained control over how much to read in each iteration. In that case use iter & yield. Note with this method one explicitly needs close the file at the end.
```
def readInChunks(fileObj, chunkSize=2048):
    """
    Lazy function to read a file piece by piece.
    Default chunk size: 2kB.

    """
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        yield data

f = open('bigFile')
for chunk in readInChunks(f):
    do_something(chunk)
f.close()
```
Pitfalls and for the sake of completeness - below methods are not as good or not as elegant for reading large files but please read to get rounded understanding.

In Python, the most common way to read lines from a file is to do the following:
```
for line in open('myfile','r').readlines():
    do_something(line)
```
When this is done, however, the readlines() function (same applies for read() function) loads the entire file into memory, then iterates over it. A slightly better approach (the first mentioned two methods above are the best) for large files is to use the fileinput module, as follows:
```
import fileinput

for line in fileinput.input(['myfile']):
    do_something(line)
```
the fileinput.input() call reads lines sequentially, but doesn't keep them in memory after they've been read or even simply so this, since file in python is iterable.

References
1. Python with statement
0 讨论(0)
发布评论:

提交评论
- 加载中...

别跟我提以往

2020-11-21 12:11

I would strongly recommend not using the default file loading as it is horrendously slow. You should look into the numpy functions and the IOpro functions (e.g. numpy.loadtxt()).

http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html

https://store.continuum.io/cshop/iopro/

Then you can break your pairwise operation into chunks:

import numpy as np
import math

lines_total = n    
similarity = np.zeros(n,n)
lines_per_chunk = m
n_chunks = math.ceil(float(n)/m)
for i in xrange(n_chunks):
    for j in xrange(n_chunks):
        chunk_i = (function of your choice to read lines i*lines_per_chunk to (i+1)*lines_per_chunk)
        chunk_j = (function of your choice to read lines j*lines_per_chunk to (j+1)*lines_per_chunk)
        similarity[i*lines_per_chunk:(i+1)*lines_per_chunk,
                   j*lines_per_chunk:(j+1)*lines_per_chunk] = fast_operation(chunk_i, chunk_j)

It's almost always much faster to load data in chunks and then do matrix operations on it than to do it element by element!!

0 讨论(0)

萌比男神i

2020-11-21 12:12
The correct, fully Pythonic way to read a file is the following:
```
with open(...) as f:
    for line in f:
        # Do something with 'line'
```
The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered I/O and memory management so you don't have to worry about large files.

There should be one -- and preferably only one -- obvious way to do it.
0 讨论(0)
发布评论:

提交评论
- 加载中...
既然无缘

2020-11-21 12:15
To strip newlines:
```
with open(file_path, 'rU') as f:
    for line_terminated in f:
        line = line_terminated.rstrip('\n')
        ...
```
With universal newline support all text file lines will seem to be terminated with '\n', whatever the terminators in the file, '\r', '\n', or '\r\n'.

EDIT - To specify universal newline support:
- Python 2 on Unix - open(file_path, mode='rU') - required ^{[thanks @Dave]}
- Python 2 on Windows - open(file_path, mode='rU') - optional
- Python 3 - open(file_path, newline=None) - optional
The newline parameter is only supported in Python 3 and defaults to None. The mode parameter defaults to 'r' in all cases. The U is deprecated in Python 3. In Python 2 on Windows some other mechanism appears to translate \r\n to \n.

Docs:
- open() for Python 2
- open() for Python 3
To preserve native line terminators:
```
with open(file_path, 'rb') as f:
    with line_native_terminated in f:
        ...
```
Binary mode can still parse the file into lines with in. Each line will have whatever terminators it has in the file.

Thanks to @katrielalex's answer, Python's open() doc, and iPython experiments.
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2

How to read a large file - line by line?

1. use of `with`

2. use of `yield`

References

To strip newlines:

To preserve native line terminators:

How to read a large file - line by line?

1. use of with

2. use of yield

References

To strip newlines:

To preserve native line terminators:

1. use of `with`

2. use of `yield`