I have a large several hudred thousand lines text file. I have to extract 30,000 specific lines that are all in the text file in random spots. This is the program I have to extr
You could try reading in big blocks, and avoiding the overhead of line-splitting except for the specific lines of interest. E.g., assuming none of your lines is longer than a megabyte:
BLOCKSIZE = 1024 * 1024
def byblock_fullines(f):
tail = ''
while True:
block = f.read(BLOCKSIZE)
if not block: break
linend = block.rindex('\n')
newtail = block[linend + 1:]
block = tail + block[:linend + 1]
tail = newtail
yield block
if tail: yield tail + '\n'
this takes an open file argument and yields blocks of about 1MB guaranteed to end with a newline. To identify (iterator-wise) all occurrences of a needle string within a haystack string:
def haystack_in_needle(haystack, needle):
start = 0
while True:
where = haystack.find(needle, start)
if where == -1: return
yield where
start = where + 1
To identify all relevant lines from within such a block:
def wantlines_inblock(s, block):
last_yielded = None
for where in haystack_in_needle(block, s):
prevend = block.rfind('\n', where) # could be -1, that's OK
if prevend == last_yielded: continue # no double-yields
linend = block.find('\n', where)
if linend == -1: linend = len(block)
yield block[prevend + 1: linend]
last_yielded = prevend
How this all fits together:
def main():
with open('bigfile.txt') as f:
with open('smallfile.txt', 'w') as g:
for block in byblock_fulllines(f):
for line in wantlines_inblock('S0414', block)
f.write(line)
In 2.7 you could fold both with
statements into one, just to reduce nesting a bit.
Note: this code is untested so there might be (hopefully small;-) errors such as off-by-one's. Performance needs tuning of the block size and must be calibrated by measurement on your specific machine and data. Your mileage may vary. Void where prohibited by law.