Extract specific text lines?

前端未结

关注

 10  1103

庸人自扰 2021-02-03 15:09

I have a large several hudred thousand lines text file. I have to extract 30,000 specific lines that are all in the text file in random spots. This is the program I have to extr

10条回答

不思量自难忘° (楼主)

2021-02-03 15:20

You could try reading in big blocks, and avoiding the overhead of line-splitting except for the specific lines of interest. E.g., assuming none of your lines is longer than a megabyte:

BLOCKSIZE = 1024 * 1024

def byblock_fullines(f):
    tail = ''
    while True:
        block = f.read(BLOCKSIZE)
        if not block: break
        linend = block.rindex('\n')
        newtail = block[linend + 1:]
        block = tail + block[:linend + 1]
        tail = newtail
        yield block
    if tail: yield tail + '\n'

this takes an open file argument and yields blocks of about 1MB guaranteed to end with a newline. To identify (iterator-wise) all occurrences of a needle string within a haystack string:

def haystack_in_needle(haystack, needle):
    start = 0
    while True:
        where = haystack.find(needle, start)
        if where == -1: return
        yield where
        start = where + 1

To identify all relevant lines from within such a block:

def wantlines_inblock(s, block):
    last_yielded = None
    for where in haystack_in_needle(block, s):
        prevend = block.rfind('\n', where)  # could be -1, that's OK
        if prevend == last_yielded: continue  # no double-yields
        linend = block.find('\n', where)
        if linend == -1: linend = len(block)
        yield block[prevend + 1: linend]
        last_yielded = prevend

How this all fits together:

def main():
    with open('bigfile.txt') as f:
        with open('smallfile.txt', 'w') as g:
            for block in byblock_fulllines(f):
                for line in wantlines_inblock('S0414', block)
                    f.write(line)

In 2.7 you could fold both with statements into one, just to reduce nesting a bit.

Note: this code is untested so there might be (hopefully small;-) errors such as off-by-one's. Performance needs tuning of the block size and must be calibrated by measurement on your specific machine and data. Your mileage may vary. Void where prohibited by law.

0 讨论(0)

查看其它10个回答