Extract specific text lines?

前端 未结 10 1103
庸人自扰
庸人自扰 2021-02-03 15:09

I have a large several hudred thousand lines text file. I have to extract 30,000 specific lines that are all in the text file in random spots. This is the program I have to extr

10条回答
  •  不思量自难忘°
    2021-02-03 15:20

    You could try reading in big blocks, and avoiding the overhead of line-splitting except for the specific lines of interest. E.g., assuming none of your lines is longer than a megabyte:

    BLOCKSIZE = 1024 * 1024
    
    def byblock_fullines(f):
        tail = ''
        while True:
            block = f.read(BLOCKSIZE)
            if not block: break
            linend = block.rindex('\n')
            newtail = block[linend + 1:]
            block = tail + block[:linend + 1]
            tail = newtail
            yield block
        if tail: yield tail + '\n'
    

    this takes an open file argument and yields blocks of about 1MB guaranteed to end with a newline. To identify (iterator-wise) all occurrences of a needle string within a haystack string:

    def haystack_in_needle(haystack, needle):
        start = 0
        while True:
            where = haystack.find(needle, start)
            if where == -1: return
            yield where
            start = where + 1
    

    To identify all relevant lines from within such a block:

    def wantlines_inblock(s, block):
        last_yielded = None
        for where in haystack_in_needle(block, s):
            prevend = block.rfind('\n', where)  # could be -1, that's OK
            if prevend == last_yielded: continue  # no double-yields
            linend = block.find('\n', where)
            if linend == -1: linend = len(block)
            yield block[prevend + 1: linend]
            last_yielded = prevend
    

    How this all fits together:

    def main():
        with open('bigfile.txt') as f:
            with open('smallfile.txt', 'w') as g:
                for block in byblock_fulllines(f):
                    for line in wantlines_inblock('S0414', block)
                        f.write(line)
    

    In 2.7 you could fold both with statements into one, just to reduce nesting a bit.

    Note: this code is untested so there might be (hopefully small;-) errors such as off-by-one's. Performance needs tuning of the block size and must be calibrated by measurement on your specific machine and data. Your mileage may vary. Void where prohibited by law.

提交回复
热议问题