Python is not my best language, and so I\'m not all that good at finding the most efficient solutions to some of my problems. I have a very large string (coming from a 30 MB
Are you implying that only a complete line will match? (your EDIT: matching on a newline only example seems to)
Then I imagine
for line in open('file').readlines():
if line==small_string:
return True
return False
IE, using == is quicker than 'in' - perhaps. I wouldn't be surprised if the underlying implementation of in catches the case where the line to search and the string to search for are the same length and just attempts an == itself.
woudl be better.
How slow is too slow? I just did an a in b
test for a unique string at the end of a 170 MB string. It finished before my finger left the Enter key.
small_string = "This is a line"
big_string = "This is a line This is another line\nThis is yet another"
test= big_string.split("This is a line" ,1)
if len(test)==2:
print "it`s there"
elif len(test)!=2:
print "it`s not"
You can use one of these algorithms:
Rabin–Karp string search algorithm
Knuth–Morris–Pratt algorithm (aka KMP) see an implementation here
Although I believe KMP is more efficient, it's more complicated to implement.The first link includes some pseudo-code that should make it very easy to implement in python.
you can look for alternatives here: http://en.wikipedia.org/wiki/String_searching_algorithm
I would rely on fast implementation by someone else:
import subprocess
from subprocess import STDOUT
import os
...
with open(os.devnull, 'w') as devnull:
if subprocess.call('grep %s "%s"' % (smallstring, file), shell=True, stdout=devnull, stderr=STDOUT) == 0:
pass #do stuff
Won't work on windows.
edit: I'm worried taht grep returns 0 wheter it finds something or not. But I don't have any shell available to me now so I can't test it.
Is it really slow? You're talking about 30MB string; let's try it with even bigger string:
In [12]: string="agu82934u"*50*1024*1024+"string to be found"
In [13]: len(string)
Out[13]: 471859218
In [14]: %timeit "string to be found" in string
1 loops, best of 3: 335 ms per loop
In [15]: %timeit "string not to be found" in string
1 loops, best of 3: 200 ms per loop
I wouldn't say that 335 ms is much time looking for substring in 450MB string.