The most efficient way to search for an array of strings in another string

后端 未结 8 2321
礼貌的吻别
礼貌的吻别 2021-02-12 15:12

I have a large arrray of strings that looks something like this: String temp[] = new String[200000].

I have another String, let\'s call it bigtext. What I ne

相关标签:
8条回答
  • 2021-02-12 15:53

    I think you're looking for an algorithm like Rabin-Karp or Aho–Corasick which are designed to search in parallel for a large number of sub-strings in a text.

    0 讨论(0)
  • 2021-02-12 16:00

    I'm afraid it's not efficient at all in any case!

    To pick the right algorithm, you need to provide some answers:

    1. What can be computed off-line? That is, is bigText known in advance? I guess temp is not, from its name.
    2. Are you actually searching for words? If so, index them. Bloom filter can help, too.
    3. If you need a bit of fuzziness, may stem or soundex can do the job?

    Sticking to strict inclusion test, you might build a trie from your temp array. It would prevent searching the same sub-string several times.

    0 讨论(0)
  • 2021-02-12 16:01

    That is a very efficient approach. You can improve it slightly by only evaluating temp.length once

    for(int x = 0, len = temp.length; x < len; x++)
    

    Although you don't provide enough detail of your program, it's quite possible you can find a more efficent approach by redesigning your program.

    0 讨论(0)
  • 2021-02-12 16:06

    Note that your current complexity is O(|S1|*n), where |S1| is the length of bigtext and n is the number of elements in your array, since each search is actually O(|S1|).

    By building a suffix tree from bigtext, and iterating on elements in the array, you could bring this complexity down to O(|S1| + |S2|*n), where |S2| is the length of the longest string in the array. Assuming |S2| << |S1|, it could be much faster!

    Building a suffix tree is O(|S1|), and each search is O(|S2|). You don't have to go through bigtext to find it, just on the relevant piece of the suffix tree. Since it is done n times, you get total of O(|S1| + n*|S2|), which is asymptotically better then the naive implementation.

    0 讨论(0)
  • 2021-02-12 16:08

    If you have additional information about temp, you can maybe improve the iteration.

    You can also reduce the time spent, if you parallelize the iteration.

    0 讨论(0)
  • 2021-02-12 16:14

    Use a search algorithm like Boyer-Moore. Google Boyer Moore, and it has lots of links which explain how it works. For instance, there is a Java example.

    0 讨论(0)
提交回复
热议问题