Python efficient way to check if very large string contains a substring

后端未结

关注

 8  1642

Python is not my best language, and so I\'m not all that good at finding the most efficient solutions to some of my problems. I have a very large string (coming from a 30 MB

相关标签:

8条回答

执笔经年

2021-01-02 05:43
Are you implying that only a complete line will match? (your EDIT: matching on a newline only example seems to)

Then I imagine
```
for line in open('file').readlines():
  if line==small_string:
    return True
return False
```
IE, using == is quicker than 'in' - perhaps. I wouldn't be surprised if the underlying implementation of in catches the case where the line to search and the string to search for are the same length and just attempts an == itself.

woudl be better.
0 讨论(0)
发布评论:

提交评论
- 加载中...
南笙

2021-01-02 05:47

How slow is too slow? I just did an a in b test for a unique string at the end of a 170 MB string. It finished before my finger left the Enter key.

0 讨论(0)
发布评论:

提交评论
- 加载中...

走了就别回头了

2021-01-02 05:47

small_string = "This is a line"
big_string = "This is a line This is another line\nThis is yet another"

test= big_string.split("This is a line" ,1)

if len(test)==2:

    print "it`s there"

elif len(test)!=2:

    print "it`s not"

0 讨论(0)

予麋鹿

2021-01-02 05:48
You can use one of these algorithms:
- Rabin–Karp string search algorithm
- Knuth–Morris–Pratt algorithm (aka KMP) see an implementation here
Although I believe KMP is more efficient, it's more complicated to implement.The first link includes some pseudo-code that should make it very easy to implement in python.

you can look for alternatives here: http://en.wikipedia.org/wiki/String_searching_algorithm
0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2021-01-02 05:50
I would rely on fast implementation by someone else:
```
import subprocess
from subprocess import STDOUT
import os

...
with open(os.devnull, 'w') as devnull:
    if subprocess.call('grep %s "%s"' % (smallstring, file), shell=True, stdout=devnull, stderr=STDOUT) == 0:
        pass #do stuff
```
Won't work on windows.

edit: I'm worried taht grep returns 0 wheter it finds something or not. But I don't have any shell available to me now so I can't test it.
0 讨论(0)
发布评论:

提交评论
- 加载中...

我在风中等你

2021-01-02 05:57

Is it really slow? You're talking about 30MB string; let's try it with even bigger string:

In [12]: string="agu82934u"*50*1024*1024+"string to be found"

In [13]: len(string)
Out[13]: 471859218

In [14]: %timeit "string to be found" in string
1 loops, best of 3: 335 ms per loop

In [15]: %timeit "string not to be found" in string
1 loops, best of 3: 200 ms per loop

I wouldn't say that 335 ms is much time looking for substring in 450MB string.

0 讨论(0)

1 2 下一页