Fastest Text search method in a large text file

前端 未结 4 1428
南旧
南旧 2021-01-19 04:35

I am doing a text search in a rather big txt file (100k lines, 7mo) Text is not that big but I need a lot of searches. I want to look for a target string and return the line

相关标签:
4条回答
  • 2021-01-19 04:44

    How about 10GB/s search speeds? https://www.codeproject.com/Articles/5282980/Fastest-Fulltext-Vector-Scalar-Exact-Searcher

    What is the most efficient way?

    The most efficient way is using vectors, if not available then the fastest SCALAR memmem() function you can get, it happened that the article above shows them both in action, you need huge text files being traversed then the memmem() variant Railgun_NyoTengu() being open-source in public domain is way to go.

    0 讨论(0)
  • 2021-01-19 05:04

    First, don't explicitly decode bytes.

    from io import open
    

    Second, consider things like this.

    with open(path,'r',encoding='UTF-8') as src:
        found= None
        for line in src:
            if len(line) == 0: break #happens at end of file, then stop loop
            if target in line:
                found= line
                break
        return found
    

    This can be simplified slightly to use return None or return line instead of break. It should run a hair faster, but it's slightly harder to make changes when there are multiple returns.

    0 讨论(0)
  • 2021-01-19 05:06

    If you are searching the same text file over and over, consider indexing the file. For example, create a dictionary that maps each word to which lines it's on. This will take a while to create, but will then make searches O(1).

    If you are searching different text files, or can't index the file for some reason, you probably won't get any faster than the KMP algorithm.

    EDIT: The index I described will only work for single word searches, not multi-word searches. If you want to search for multiple words (any string) then you probably won't be able to index it.

    0 讨论(0)
  • 2021-01-19 05:08
    1. Load the whole text in RAM at once. Don't read line by line.
    2. Search for the pattern in the blob. If you find it, use text.count('\n',0,pos) to get the line number.
    3. If you don't need the line number, look for the previous and next EOL to cut the line out of the text.

    The loop in Python is slow. String searching is very fast. If you need to look for several strings, use regular expressions.

    If that's not fast enough, use an external program like grep.

    0 讨论(0)
提交回复
热议问题