RegEx that matches only if a string contains a word from each list

筅森魡賤 提交于 2020-01-05 04:32:27

问题


I'm developing a software that has to check if a text contains a word taken from a specified list and a word taken from another specified list.

Example:

list 1: dog, cat
list 2: house, tree

the following texts has to match:

the dog is in the house -> contains dog and house
my house is full of dogs -> contains dog and house
the cat is on the tree -> contains cat and tree

the following examples must not to match

the frog is in the house -> there is no word from the first list
Boby is the name of my dog -> there is no word from the second list
Outside my house there is a tree -> there is no word from the first list

To solve quickly the problem I've made a list of pattern like:

dog.*house, house.*dog, cat.*house, ...

but I'm pretty sure there is an smarter way...


回答1:


You can use an alternation (|) for each of the sets of alternatives, and a wrapper alternation for the order. So:

(?:(?:dog|cat).*(?:house|tree))|(?:(?:house|tree).*(?:dog|cat))

JavaScript Example (non-capturing groups and alternations work the same in Java and JavaScript):

var tests = [
    {match: true,  text: "the dog is in the house -> contains dog and house"},
    {match: true,  text: "my house is full of dogs -> contains dog and house"},
    {match: true,  text: "the cat is on the tree -> contains cat and tree"},
    {match: false, text: "the frog is in the house -> there is no word from the first list"},
    {match: false, text: "Boby is the name of my dog -> there is no word from the second list"},
    {match: false, text: "Outside my house there is a tree -> there is no word from the first list"}
];
var rex = /(?:(?:dog|cat).*(?:house|tree))|(?:(?:house|tree).*(?:dog|cat))/;
tests.forEach(function(test) {
  var result = rex.test(test.text);
  if (!!result == !!test.match) {
    console.log('GOOD: "' + test.text + '": ' + result);
  } else {
    console.log('BAD: "' + test.text + '": ' + result + ' (expected ' + test.match + ')');
  }
});
.as-console-wrapper {
  max-height: 100% !important;
}

Note that in the above we're not checking for words, just sequences of letters. If you want it to be actual words, you'll need to add word break assertions or similar. Left as an exercise to the reader...




回答2:


Here is a solution that works for any number of lists, with any number of words in them.

RegEx is for scanning linear sequences. However, you are asking two questions, which can be true or false, irrespective of the order of the patterns. Therefore, you must enumerate all permutations of regex combinations. For small number of lists, this can be done manually, as is demonstrated in the other answer. Below is the solution for the general case.

You certainly don't want to write the regex out manually, so here is a Java program that does what you want:

import java.util.*;
import java.util.stream.*;
import static java.util.Arrays.asList;
import static java.util.stream.Collectors.toList;

public class RegexWithPermutations {

    /** Build a regex the checks whether
      * a string contains one of the words.
      */
    public static String containsWordRegex(List<String> words) {
      StringBuilder sb = new StringBuilder();
      boolean first = true;
      for (String w: words) {
        if (!first) {
          sb.append("|");
        }
        sb.append("(?:" + w + ")");
        first = false;
      }
      return sb.toString();
    }

    /** Generates all permutations of regexes.
      */
    public static String allRegexPermutations(
      final List<String> regexes,
      final String separator
    ) {
      class PermutationHelper {
        /** Deletes one element from the array */
        private int[] remove(int[] arr, int idx) {
          int n = arr.length;
          int[] res = new int[n - 1];
          System.arraycopy(arr, 0, res, 0, idx);
          System.arraycopy(arr, idx + 1, res, idx, n - idx - 1);
          return res;
        }

        /** Helper method that generates all permutations combined with "|".
          */
        public List<String> rec(String suffix, int[] unusedIndices) {
          if (unusedIndices.length == 1) {
            return asList(regexes.get(unusedIndices[0]) + suffix);
          } else {
            return IntStream.range(0, unusedIndices.length)
              .boxed()
              .<String>flatMap(i -> rec(
                separator + regexes.get(unusedIndices[i]), // (suffix.isEmpty() ? "" : ("COMB" + suffix))
                remove(unusedIndices, i)
              ).stream())
              .collect(toList());
          }
        }
      }
      int[] startIndices = new int[regexes.size()];
      for (int i = 0; i < regexes.size(); i++) {
        startIndices[i] = i;
      }
      List<String> ps = (new PermutationHelper()).rec("", startIndices);
      StringBuilder b = new StringBuilder();
      boolean first = true;
      for (String p : ps) {
        if (!first) {
          b.append("|");
        }
        b.append(p);
        first = false;
      }
      return b.toString();
    }

    public static void main(String[] args) {
      List<String> list_1 = asList("dog", "cat");
      List<String> list_2 = asList("house", "tree");  

      List<String> examples = asList(
        "the dog is in the house",
        "my house is full of dogs",
        "the cat is on the tree",
        "the frog is in the house",
        "Boby is the name of my dog",
        "Outside my house there is a tree"
      );

      String regex = ".*(?:" + allRegexPermutations(asList(
        "(?:" + containsWordRegex(list_1) + ")",
        "(?:" + containsWordRegex(list_2) + ")"
      ), ".*") + ").*";

      System.out.println("Constructed regex: " + regex);

      for (String example: examples) {
        System.out.println(example + " -> " + example.matches(regex));
      }
    }
}

Output:

    Constructed regex: .*(?:(?:(?:house)|(?:tree)).*(?:(?:dog)|(?:cat))|(?:(?:dog)|(?:cat)).*(?:(?:house)|(?:tree))).*
    the dog is in the house -> true
    my house is full of dogs -> true
    the cat is on the tree -> true
    the frog is in the house -> false
    Boby is the name of my dog -> false
    Outside my house there is a tree -> false

It works with arbitrary numbers of lists (but the length of the regex grows hyperexponentially, so it's not advisable to use it for anything longer than 3, 4, 5).



来源:https://stackoverflow.com/questions/48563933/regex-that-matches-only-if-a-string-contains-a-word-from-each-list

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!