Cheap way to search a large text file for a string

后端 未结 9 1021
隐瞒了意图╮
隐瞒了意图╮ 2020-11-27 04:15

I need to search a pretty large text file for a particular string. Its a build log with about 5000 lines of text. Whats the best way to go about doing that? Using regex sho

相关标签:
9条回答
  • 2020-11-27 05:07

    If there is no way to tell where the string will be (first half, second half, etc) then there is really no optimized way to do the search other than the builtin "find" function. You could reduce the I/O time and memory consumption by not reading the file all in one shot, but at 4kb blocks (which is usually the size of an hard disk block). This will not make the search faster, unless the string is in the first part of the file, but in all case will reduce memory consumption which might be a good idea if the file is huge.

    0 讨论(0)
  • 2020-11-27 05:07

    This is entirely inspired by laurasia's answer above, but it refines the structure.

    It also adds some checks:

    • It will correctly return 0 when searching an empty file for the empty string. In laurasia's answer, this is an edge case that will return -1.
    • It also pre-checks whether the goal string is larger than the buffer size, and raises an error if this is the case.

    In practice, the goal string should be much smaller than the buffer for efficiency, and there are more efficient methods of searching if the size of the goal string is very close to the size of the buffer.

    def fnd(fname, goal, start=0, bsize=4096):
        if bsize < len(goal):
            raise ValueError("The buffer size must be larger than the string being searched for.")
        with open(fname, 'rb') as f:
            if start > 0:
                f.seek(start)
            overlap = len(goal) - 1
            while True:
                buffer = f.read(bsize)
                pos = buffer.find(goal)
                if pos >= 0:
                    return f.tell() - len(buffer) + pos
                if not buffer:
                    return -1
                f.seek(f.tell() - overlap)
    
    0 讨论(0)
  • 2020-11-27 05:16

    You could do a simple find:

    f = open('file.txt', 'r')
    lines = f.read()
    answer = lines.find('string')
    

    A simple find will be quite a bit quicker than regex if you can get away with it.

    0 讨论(0)
提交回复
热议问题