Efficient way to read a specific line number of a file. (BONUS: Python Manual Misprint)

前端 未结 5 689
醉梦人生
醉梦人生 2021-02-09 17:02

I have a 100 GB text file, which is a BCP dump from a database. When I try to import it with BULK INSERT, I get a cryptic error on line number 219506324. Before sol

相关标签:
5条回答
  • 2021-02-09 17:27

    Here's my elegant version in C#:

    Console.Write(File.ReadLines(@"s:\source\transactions.dat").ElementAt(219506323));
    

    or more general:

    Console.Write(File.ReadLines(filename).ElementAt(linenumber - 1));
    

    Of course, you may want to show some context before and after the given line:

    Console.Write(string.Join("\n",
                  File.ReadLines(filename).Skip(linenumber - 5).Take(10)));
    

    or more fluently:

    File
    .ReadLines(filename)
    .Skip(linenumber - 5)
    .Take(10)
    .AsObservable()
    .Do(Console.WriteLine);
    

    BTW, the linecache module does not do anything clever with large files. It just reads the whole thing in, keeping it all in memory. The only exceptions it catches are I/O-related (can't access file, file not found, etc.). Here's the important part of the code:

        fp = open(fullname, 'rU')
        lines = fp.readlines()
        fp.close()
    

    In other words, it's trying to fit the whole 100GB file into 6GB of RAM! What the manual should say is maybe "This function will never throw an exception if it can't access the file."

    0 讨论(0)
  • 2021-02-09 17:31

    Well, memory can run out at any time, asynchronously and unpredictably -- that's why the "never an exception" promise doesn't really apply there (just like, say, in Java, where every method must specify which exceptions it can raise, some exceptions are exempted from this rule, since just about any method, unpredictably, can raise them at any time due to resource scarcity or other system-wide issues).

    linecache tries to read the whole file. Your only simple alternative (hopefully you're not in a hurry) is to read one line at a time from the start...:

    def readoneline(filepath, linenum):
        if linenum < 0: return ''
        with open(filepath) as f:
            for i, line in enumerate(filepath):
                if i == linenum: return line
            return ''
    

    Here, linenum is 0-based (if you don't like that, and your Python is 2.6 or better, pass a starting value of 1 to enumerate), and the return value is the empty string for invalid line numbers.

    Somewhat faster (and a lot more complicated) is to read, say, 100 MB at a time (in binary mode) into a buffer; count the number of line-ends in the buffer (just a .count('\n') call on the buffer string object); once the running total of line ends exceeds the linenum you're looking for, find the Nth line-end currently in the buffer (where N is the difference between linenum, here 1-based, and the previous running total of line ends), read a bit more if the N+1st line-end is not also in the buffer (as that's the point where your line ends), extract the relevant substring. Not just a couple of lines net of the with and returns for anomalous cases...;-).

    Edit: since the OP comments doubting that reading by-buffer instead of by-line can make a performance difference, I unrooted an old piece of code where I was measuring the two approaches for a somewhat-related tasks -- counting the number of lines with the buffer approach, a loop on lines, or reading the whole file in memory at one gulp (by readlines). The target file is kjv.txt, the standard English text of the King James' Version of the Bible, one line per verse, ASCII:

    $ wc kjv.txt 
      114150  821108 4834378 kjv.txt
    

    Platform is a MacOS Pro laptop, OSX 10.5.8, Intel Core 2 Duo at 2.4 GHz, Python 2.6.5.

    The module for the test, readkjv.py:

    def byline(fn='kjv.txt'):
        with open(fn) as f:
            for i, _ in enumerate(f):
                pass
        return i +1
    
    def byall(fn='kjv.txt'):
        with open(fn) as f:
            return len(f.readlines())
    
    def bybuf(fn='kjv.txt', BS=100*1024):
        with open(fn, 'rb') as f:
            tot = 0
            while True:
                blk = f.read(BS)
                if not blk: return tot
                tot += blk.count('\n')
    
    if __name__ == '__main__':
        print bybuf()
        print byline()
        print byall()
    

    The prints are just to confirm correctness of course (and do;-).

    The measurement, of course after a few dry runs to ensure everybody's benefitting equally from the OS's, disk controller's, and filesystem's read-ahead functionality (if any):

    $ py26 -mtimeit -s'import readkjv' 'readkjv.byall()'
    10 loops, best of 3: 40.3 msec per loop
    $ py26 -mtimeit -s'import readkjv' 'readkjv.byline()'
    10 loops, best of 3: 39 msec per loop
    $ py26 -mtimeit -s'import readkjv' 'readkjv.bybuf()'
    10 loops, best of 3: 25.5 msec per loop
    

    The numbers are quite repeatable. As you see, even on such a tiny file (less than 5 MB!), by-line approaches are slower than buffer-based ones -- just too much wasted effort!

    To check scalability, I next used a 4-times-larger file, as follows:

    $ cat kjv.txt kjv.txt kjv.txt kjv.txt >k4.txt
    $ wc k4.txt
      456600 3284432 19337512 k4.txt
    $ py26 -mtimeit -s'import readkjv' 'readkjv.bybuf()'
    10 loops, best of 3: 25.4 msec per loop
    $ py26 -mtimeit -s'import readkjv' 'readkjv.bybuf("k4.txt")'
    10 loops, best of 3: 102 msec per loop
    

    and, as predicted, the by-buffer approach scales just about exactly linearly. Extrapolating (always a risky endeavour, of course;-), a bit less than 200 MB per second seems to be the predictable performance -- call it 6 seconds per GB, or maybe 10 minutes for 100 GB.

    Of course what this small program does is just line counting, but (once there is enough I/O t amortize constant overheads;-) a program to read a specific line should have similar performance (even though it needs more processing once it's found "the" buffer of interest, it's a roughly constant amount of processing, for a buffer of a given size -- presumably repeated halving of the buffer to identify a small-enough part of it, then a little bit of effort linear in the size of the multiply-halved "buffer remainder").

    Elegant? Not really... but, for speed, pretty hard to beat!-)

    0 讨论(0)
  • 2021-02-09 17:32

    If you expect to have this operation needed often on the same file it would make sense to make an index.

    You make an index by going through the whole file once and recording the positions of line beginnings, for example in a sqlite database. Then when you need to go to a specific line you query the index for it, seek to that position and read the line.

    0 讨论(0)
  • 2021-02-09 17:48

    Not a elegant but a faster solution would be to use multiple threads (or tasks in .NET 4.0) to read & process multiple chunks of a file at the same time.

    0 讨论(0)
  • 2021-02-09 17:50

    You can try this sed one-liner: sed '42q;d' to fetch line number 42. It's not in Python or C#, but I assume you have sed on your Mac.

    0 讨论(0)
提交回复
热议问题