Better way to detect if a string contains multiple words

问题

I am trying to create a program that detects if multiple words are in a string as fast as possible, and if so, executes a behavior. Preferably, I would like it to detect the order of these words too but only if this can be done fast. So far, this is what I have done:

if (input.contains("adsf") && input.contains("qwer")) {
    execute();          
}

As you can see, doing this for multiple words would become tiresome. Is this the only way or is there a better way of detecting multiple substrings? And is there any way of detecting order?

回答1:

Editors note: Despite being heavily upvoted and accepted, this does not function the same as the code in the question. execute is called on the first match, like a logical OR.

You could use an array:

String[] matches = new String[] {"adsf", "qwer"};

bool found = false;
for (String s : matches)
{
  if (input.contains(s))
  {
    execute();
    break;
  }
}

This is efficient as the one posted by you but more maintainable. Looking for a more efficient solution sounds like a micro optimization that should be ignored until proven to be effectively a bottleneck of your code, in any case with a huge string set the solution could be a trie.

回答2:

I'd create a regular expression from the words:

Pattern pattern = Pattern.compile("(?=.*adsf)(?=.*qwer)");
if (pattern.matcher(input).find()) {
    execute();
}

For more details, see this answer: https://stackoverflow.com/a/470602/660143

回答3:

In Java 8 you could do

public static boolean containsWords(String input, String[] words) {
    return Arrays.stream(words).allMatch(input::contains);
}

Sample usage:

String input = "hello, world!";
String[] words = {"hello", "world"};
if (containsWords(input, words)) System.out.println("Match");

回答4:

If you have a lot of substrings to look up, then a regular expression probably isn't going to be much help, so you're better off putting the substrings in a list, then iterating over them and calling input.indexOf(substring) on each one. This returns an int index of where the substring was found. If you throw each result (except -1, which means that the substring wasn't found) into a TreeMap (where index is the key and the substring is the value), then you can retrieve them in order by calling keys() on the map.

Map<Integer, String> substringIndices = new TreeMap<Integer, String>();
List<String> substrings = new ArrayList<String>();
substrings.add("asdf");
// etc.

for (String substring : substrings) {
  int index = input.indexOf(substring);

  if (index != -1) {
    substringIndices.put(index, substring);
  }
}

for (Integer index : substringIndices.keys()) {
  System.out.println(substringIndices.get(index));
}

回答5:

Use a tree structure to hold the substrings per codepoint. This eliminates the need to

Note that this is efficient only if the needle set is almost constant. It is not inefficient if there are individual additions or removals of substrings though, but a different initialization each time to arrange a lot of strings into a tree structure would definitely slower it.

`StringSearcher`:

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import java.util.HashMap;

class StringSearcher{
    private NeedleTree needles = new NeedleTree(-1);
    private boolean caseSensitive;
    private List<Integer> lengths = new ArrayList<>();
    private int maxLength;

    public StringSearcher(List<String> inputs, boolean caseSensitive){
        this.caseSensitive = caseSensitive;
        for(String input : inputs){
            if(!lengths.contains(input.length())){
                lengths.add(input.length());
            }
            NeedleTree tree = needles;
            for(int i = 0; i < input.length(); i++){
                tree = tree.child(caseSensitive ? input.codePointat(i) : Character.toLowerCase(input.codePointAt(i)));
            }
            tree.markSelfSet();
        }
        maxLength = Collections.max(legnths);
    }

    public boolean matches(String haystack){
        if(!caseSensitive){
            haystack = haystack.toLowerCase();
        }
        for(int i = 0; i < haystack.length(); i++){
            String substring = haystack.substring(i, i + maxLength); // maybe we can even skip this and use from haystack directly?
            NeedleTree tree = needles;
            for(int j = 0; j < substring.maxLength; j++){
                tree = tree.childOrNull(substring.codePointAt(j));
                if(tree == null){
                    break;
                }
                if(tree.isSelfSet()){
                    return true;
                }
            }
        }
        return false;
    }
}

`NeedleTree.java`:

import java.util.HashMap;
import java.util.Map;

class NeedleTree{
    private int codePoint;
    private boolean selfSet;
    private Map<Integer, NeedleTree> children = new HashMap<>();

    public NeedleTree(int codePoint){
        this.codePoint = codePoint;
    }

    public NeedleTree childOrNull(int codePoint){
        return children.get(codePoint);
    }

    public NeedleTree child(int codePoint){
        NeedleTree child = children.get(codePoint);
        if(child == null){
            child = children.put(codePoint, new NeedleTree(codePoint));
        }
        return child;
    }

    public boolean isSelfSet(){
        return selfSet;
    }

    public void markSelfSet(){
        selfSet = true;
    }
}

回答6:

This is a classical interview and CS problem.

Robin Karp algorithm is usually what people first talk about in interviews. The basic idea is that as you go through the string, you add the current character to the hash. If the hash matches the hash of one of your match strings, you know that you might have a match. This avoids having to scan back and forth into your match strings. https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm

Other typical topics for that interview question are to consider a trie structure to speed up the lookup. If you have a large set of match strings, you have to always check a large set of match strings. A trie structure is more efficient to do that check. https://en.wikipedia.org/wiki/Trie

Additional algorithms are: - Aho–Corasick https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm - Commentz-Walter https://en.wikipedia.org/wiki/Commentz-Walter_algorithm

回答7:

I think a better approach would be something like this, where we can add multiple values as a one string and by index of function validate index

String s = "123"; 
System.out.println(s.indexOf("1")); // 0
System.out.println(s.indexOf("2")); // 1 
System.out.println(s.indexOf("5")); // -1

来源：https://stackoverflow.com/questions/18885043/better-way-to-detect-if-a-string-contains-multiple-words

标签

java

string

substring

contains