Fastest way to find Strings in String collection that begin with certain chars

强颜欢笑 提交于 2019-12-07 15:55:24

问题


I have a large collection of Strings. I want to be able to find the Strings that begin with "Foo" or the Strings that end with "Bar". What would be the best Collection type to get the fastest results? (I am using Java)

I know that a HashSet is very fast for complete matches, but not for partial matches I would think? So, what could I use instead of just looping through a List? Should I look into LinkedList's or similar types? Are there any Collection Types that are optimized for this kind of queries?


回答1:


The best collection type for this problem is SortedSet. You would need two of them in fact:

  1. Words in regular order.
  2. Words with their characters inverted.

Once these SortedSets have been created, you can use method subSet to find what you are looking for. For example:

  1. Words starting with "Foo":

     forwardSortedSet.subSet("Foo","Fop");
    
  2. Words ending with "Bar":

     backwardSortedSet.subSet("raB","raC");
    

The reason we are "adding" 1 to the last search character is to obtain the whole range. The "ending" word is excluded from the subSet, so there is no problem.

EDIT: Of the two concrete classes that implement SortedSet in the standard Java library, use TreeSet. The other (ConcurrentSkipListSet) is oriented to concurrent programs and thus not optimized for this situation.




回答2:


It's been a while but I needed to implement this now and did some testing.

I already have a HashSet<String> as source so generation of all other datastructures is included in search time. 100 different sources are used and each time the data structures need to be regenerated. I only need to match a few single Strings each time. These tests ran on Android.

Methods:

  1. Simple loop through HashSet and call endsWith() on each string

  2. Simple loop through HashSet and perform precompiled Pattern match (regex) on each string.

  3. Convert HashSet to single String joined by \n and single match on whole String.

  4. Generate SortedTree with reversed Strings from HashSet. Then match with subset() as explained by @Mario Rossi.

Results:

Duration for method 1: 173ms (data setup:0ms search:173ms)
Duration for method 2: 6909ms (data setup:0ms search:6909ms)
Duration for method 3: 3026ms (data setup:2377ms search:649ms)
Duration for method 4: 2111ms (data setup:2101ms search:10ms)

Conclusion:

SortedSet/SortedTree is extremely fast in searching. Much faster than just looping through all Strings. However, creating the structure takes a lot of time. Regexes are much slower, but generating a single large String out of hundreds of Strings is more of a bottleneck on Android/Java.

If only a few matches need to be made, then you better loop through your collection. If you have much more matches to make it may be very useful to use a SortedTree!




回答3:


If the list of words is stable (not many words are added or deleted), a very good second alternative is to create 2 lists:

  1. One with the words in normal order.
  2. The second with the characters in each word reversed.

For speed purposes, make them ArrayLists. Never LinkedLists or other variants which perform extremely bad on random access (the core of binary search; see below).

After the lists are created, they can be sorted with method Collections.sort (only once each) and then searched with Collections.binarySearch. For example:

    Collections.sort(forwardList);
    Collections.sort(backwardList);

And then to search for words starting in "Foo":

    int i= Collections.binarySearch(forwardList,"Foo") ;
    while( i < forwardList.size() && forwardList.get(i).startsWith("Foo") ) {
        // Process String forwardList.get(i)
        i++;
    }

And words ending in "Bar":

    int i= Collections.binarySearch(backwardList,"raB") ;
    while( i < backwardList.size() &&  backwardList.get(i).startsWith("raB") ) {
        // Process String backwardList.get(i)
        i++;
    }


来源:https://stackoverflow.com/questions/18564744/fastest-way-to-find-strings-in-string-collection-that-begin-with-certain-chars

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!