I have a 100 GB text file, which is a BCP dump from a database. When I try to import it with BULK INSERT
, I get a cryptic error on line number 219506324. Before sol
Well, memory can run out at any time, asynchronously and unpredictably -- that's why the "never an exception" promise doesn't really apply there (just like, say, in Java, where every method must specify which exceptions it can raise, some exceptions are exempted from this rule, since just about any method, unpredictably, can raise them at any time due to resource scarcity or other system-wide issues).
linecache
tries to read the whole file. Your only simple alternative (hopefully you're not in a hurry) is to read one line at a time from the start...:
def readoneline(filepath, linenum):
if linenum < 0: return ''
with open(filepath) as f:
for i, line in enumerate(filepath):
if i == linenum: return line
return ''
Here, linenum
is 0-based (if you don't like that, and your Python is 2.6 or better, pass a starting value of 1
to enumerate
), and the return value is the empty string for invalid line numbers.
Somewhat faster (and a lot more complicated) is to read, say, 100 MB at a time (in binary mode) into a buffer; count the number of line-ends in the buffer (just a .count('\n')
call on the buffer string object); once the running total of line ends exceeds the linenum you're looking for, find the Nth line-end currently in the buffer (where N
is the difference between linenum
, here 1-based, and the previous running total of line ends), read a bit more if the N+1
st line-end is not also in the buffer (as that's the point where your line ends), extract the relevant substring. Not just a couple of lines net of the with
and returns for anomalous cases...;-).
Edit: since the OP comments doubting that reading by-buffer instead of by-line can make a performance difference, I unrooted an old piece of code where I was measuring the two approaches for a somewhat-related tasks -- counting the number of lines with the buffer approach, a loop on lines, or reading the whole file in memory at one gulp (by readlines
). The target file is kjv.txt
, the standard English text of the King James' Version of the Bible, one line per verse, ASCII:
$ wc kjv.txt
114150 821108 4834378 kjv.txt
Platform is a MacOS Pro laptop, OSX 10.5.8, Intel Core 2 Duo at 2.4 GHz, Python 2.6.5.
The module for the test, readkjv.py
:
def byline(fn='kjv.txt'):
with open(fn) as f:
for i, _ in enumerate(f):
pass
return i +1
def byall(fn='kjv.txt'):
with open(fn) as f:
return len(f.readlines())
def bybuf(fn='kjv.txt', BS=100*1024):
with open(fn, 'rb') as f:
tot = 0
while True:
blk = f.read(BS)
if not blk: return tot
tot += blk.count('\n')
if __name__ == '__main__':
print bybuf()
print byline()
print byall()
The print
s are just to confirm correctness of course (and do;-).
The measurement, of course after a few dry runs to ensure everybody's benefitting equally from the OS's, disk controller's, and filesystem's read-ahead functionality (if any):
$ py26 -mtimeit -s'import readkjv' 'readkjv.byall()'
10 loops, best of 3: 40.3 msec per loop
$ py26 -mtimeit -s'import readkjv' 'readkjv.byline()'
10 loops, best of 3: 39 msec per loop
$ py26 -mtimeit -s'import readkjv' 'readkjv.bybuf()'
10 loops, best of 3: 25.5 msec per loop
The numbers are quite repeatable. As you see, even on such a tiny file (less than 5 MB!), by-line approaches are slower than buffer-based ones -- just too much wasted effort!
To check scalability, I next used a 4-times-larger file, as follows:
$ cat kjv.txt kjv.txt kjv.txt kjv.txt >k4.txt
$ wc k4.txt
456600 3284432 19337512 k4.txt
$ py26 -mtimeit -s'import readkjv' 'readkjv.bybuf()'
10 loops, best of 3: 25.4 msec per loop
$ py26 -mtimeit -s'import readkjv' 'readkjv.bybuf("k4.txt")'
10 loops, best of 3: 102 msec per loop
and, as predicted, the by-buffer approach scales just about exactly linearly. Extrapolating (always a risky endeavour, of course;-), a bit less than 200 MB per second seems to be the predictable performance -- call it 6 seconds per GB, or maybe 10 minutes for 100 GB.
Of course what this small program does is just line counting, but (once there is enough I/O t amortize constant overheads;-) a program to read a specific line should have similar performance (even though it needs more processing once it's found "the" buffer of interest, it's a roughly constant amount of processing, for a buffer of a given size -- presumably repeated halving of the buffer to identify a small-enough part of it, then a little bit of effort linear in the size of the multiply-halved "buffer remainder").
Elegant? Not really... but, for speed, pretty hard to beat!-)