Java: Matching Phrases in a String

问题

I have a list of phrases (phrase might consist of one or more words) in a database and an input string. I need to find out which of those phrases appear in the input string.

Is there an efficient way to perform such matching in Java?

回答1:

A quick hack would be:

Build a regexp based on the combined phrases
Construct a set listing the phrases that haven't matched so far
Repeatedly run find until all phrases have been found or end of input is reached, removing matches from the set of remaining phrases to find

That way, the input is traversed only once, regardless how many phrases you provide. If the regexp compiler generates an efficient matcher for multiple alternatives, this should yield decent performance. However, this depends a lot on your phrases and input string, as well as the quality of the Java regexp engine.

Sample code (tested, but not optimized or profiled for performance):

public static boolean hasAllPhrasesInInput(List<String> phrases, String input) {
    Set<String> phrasesToFind = new HashSet<String>();
    StringBuilder sb = new StringBuilder();
    for (String phrase : phrases) {
        if (sb.length() > 0) {
            sb.append('|');
        }
        sb.append(Pattern.quote(phrase));
        phrasesToFind.add(phrase.toLowerCase());
    }
    Pattern pattern = Pattern.compile(sb.toString(), Pattern.CASE_INSENSITIVE);
    Matcher matcher = pattern.matcher(input);
    while (matcher.find()) {
        phrasesToFind.remove(matcher.group().toLowerCase());
        if (phrasesToFind.isEmpty()) {
            return true;
        }
    }
    return false;
}

Some caveats:

The code above will match phrases as substrings of words. If only complete words should match, you will need to add word boundaries ("\b") to the generated regexps.
The code must be modified if some phrases may be substrings of other phrases.
If you need to match non-ASCII text, you should add the regexp option Pattern.UNICODE_CASE and call toLowerCase(Locale) instead of toLowerCase(), using a suitable Locale.

回答2:

Here is a solution using java. As you have not specified anything about the strings you use i consider a generic example

Pattern p = Pattern.compile("cat");
        // Create a matcher with an input string
Matcher m = p.matcher("one cat," +" two cats in the yard");
boolean b = m.matches();  // Should return true

Hope that helps

Reference: http://java.sun.com/developer/technicalArticles/releases/1.4regex/

回答3:

You can organize the search phrases from your database into a tree based on the common beginnings. Than you can analyze your string character by character trying to match to the nodes of that tree.

回答4:

sql = "SELECT phrase " + 
  " FROM phrases " + 
  " WHERE phrase LIKE $1";     
PreparedStatement pstmt =  conn.prepareStatement (sql);
// probably repeated, if more than one input:
pstmt.setString (1, "%" + input + "%");
ResultSet rs = pstmt.executeQuery ();

A prepared statement is checked to fit to the database, and is faster for repeated invokation, so if you have more than one input, it should still be fast, performed in a loop.

Of course you could load all your phrases into RAM, into an map. Slow in preparation, it might be faster if you have multiple invocations, not just one input. But databases are often quite good efficient for search.

来源：https://stackoverflow.com/questions/6036192/java-matching-phrases-in-a-string

标签

java

database

matching

phrase