Python efficient way to check if very large string contains a substring

后端 未结 8 1627
無奈伤痛
無奈伤痛 2021-01-02 05:04

Python is not my best language, and so I\'m not all that good at finding the most efficient solutions to some of my problems. I have a very large string (coming from a 30 MB

相关标签:
8条回答
  • 2021-01-02 05:43

    Are you implying that only a complete line will match? (your EDIT: matching on a newline only example seems to)

    Then I imagine

    for line in open('file').readlines():
      if line==small_string:
        return True
    return False
    

    IE, using == is quicker than 'in' - perhaps. I wouldn't be surprised if the underlying implementation of in catches the case where the line to search and the string to search for are the same length and just attempts an == itself.

    woudl be better.

    0 讨论(0)
  • 2021-01-02 05:47

    How slow is too slow? I just did an a in b test for a unique string at the end of a 170 MB string. It finished before my finger left the Enter key.

    0 讨论(0)
  • 2021-01-02 05:47
    small_string = "This is a line"
    big_string = "This is a line This is another line\nThis is yet another"
    
    test= big_string.split("This is a line" ,1)
    
    if len(test)==2:
    
        print "it`s there"
    
    elif len(test)!=2:
    
        print "it`s not"
    
    0 讨论(0)
  • 2021-01-02 05:48

    You can use one of these algorithms:

    • Rabin–Karp string search algorithm

    • Knuth–Morris–Pratt algorithm (aka KMP) see an implementation here

    Although I believe KMP is more efficient, it's more complicated to implement.The first link includes some pseudo-code that should make it very easy to implement in python.

    you can look for alternatives here: http://en.wikipedia.org/wiki/String_searching_algorithm

    0 讨论(0)
  • 2021-01-02 05:50

    I would rely on fast implementation by someone else:

    import subprocess
    from subprocess import STDOUT
    import os
    
    ...
    with open(os.devnull, 'w') as devnull:
        if subprocess.call('grep %s "%s"' % (smallstring, file), shell=True, stdout=devnull, stderr=STDOUT) == 0:
            pass #do stuff
    

    Won't work on windows.

    edit: I'm worried taht grep returns 0 wheter it finds something or not. But I don't have any shell available to me now so I can't test it.

    0 讨论(0)
  • 2021-01-02 05:57

    Is it really slow? You're talking about 30MB string; let's try it with even bigger string:

    In [12]: string="agu82934u"*50*1024*1024+"string to be found"
    
    In [13]: len(string)
    Out[13]: 471859218
    
    In [14]: %timeit "string to be found" in string
    1 loops, best of 3: 335 ms per loop
    
    In [15]: %timeit "string not to be found" in string
    1 loops, best of 3: 200 ms per loop
    

    I wouldn't say that 335 ms is much time looking for substring in 450MB string.

    0 讨论(0)
提交回复
热议问题