The most efficient way to search for an array of strings in another string

后端未结

关注

 8  2321

礼貌的吻别

I have a large arrray of strings that looks something like this: String temp[] = new String[200000].

I have another String, let\'s call it bigtext. What I ne

相关标签:

8条回答

挽巷

2021-02-12 15:53

I think you're looking for an algorithm like Rabin-Karp or Aho–Corasick which are designed to search in parallel for a large number of sub-strings in a text.

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2021-02-12 16:00
I'm afraid it's not efficient at all in any case!

To pick the right algorithm, you need to provide some answers:
1. What can be computed off-line? That is, is bigText known in advance? I guess temp is not, from its name.
2. Are you actually searching for words? If so, index them. Bloom filter can help, too.
3. If you need a bit of fuzziness, may stem or soundex can do the job?
Sticking to strict inclusion test, you might build a trie from your temp array. It would prevent searching the same sub-string several times.
0 讨论(0)
发布评论:

提交评论
- 加载中...
我在风中等你

2021-02-12 16:01
That is a very efficient approach. You can improve it slightly by only evaluating temp.length once
```
for(int x = 0, len = temp.length; x < len; x++)
```
Although you don't provide enough detail of your program, it's quite possible you can find a more efficent approach by redesigning your program.
0 讨论(0)
发布评论:

提交评论
- 加载中...
渐次进展

2021-02-12 16:06

Note that your current complexity is O(|S1|*n), where |S1| is the length of bigtext and n is the number of elements in your array, since each search is actually O(|S1|).

By building a suffix tree from bigtext, and iterating on elements in the array, you could bring this complexity down to O(|S1| + |S2|*n), where |S2| is the length of the longest string in the array. Assuming |S2| << |S1|, it could be much faster!

Building a suffix tree is O(|S1|), and each search is O(|S2|). You don't have to go through bigtext to find it, just on the relevant piece of the suffix tree. Since it is done n times, you get total of O(|S1| + n*|S2|), which is asymptotically better then the naive implementation.

0 讨论(0)
发布评论:

提交评论
- 加载中...
滥情空心

2021-02-12 16:08

If you have additional information about temp, you can maybe improve the iteration.

You can also reduce the time spent, if you parallelize the iteration.

0 讨论(0)
发布评论:

提交评论
- 加载中...
借酒劲吻你

2021-02-12 16:14

Use a search algorithm like Boyer-Moore. Google Boyer Moore, and it has lots of links which explain how it works. For instance, there is a Java example.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页