问题
I'm developing a software that has to check if a text contains a word taken from a specified list and a word taken from another specified list.
Example:
list 1: dog, cat
list 2: house, tree
the following texts has to match:
the dog is in the house -> contains dog and house
my house is full of dogs -> contains dog and house
the cat is on the tree -> contains cat and tree
the following examples must not to match
the frog is in the house -> there is no word from the first list
Boby is the name of my dog -> there is no word from the second list
Outside my house there is a tree -> there is no word from the first list
To solve quickly the problem I've made a list of pattern like:
dog.*house, house.*dog, cat.*house, ...
but I'm pretty sure there is an smarter way...
回答1:
You can use an alternation (|
) for each of the sets of alternatives, and a wrapper alternation for the order. So:
(?:(?:dog|cat).*(?:house|tree))|(?:(?:house|tree).*(?:dog|cat))
JavaScript Example (non-capturing groups and alternations work the same in Java and JavaScript):
var tests = [
{match: true, text: "the dog is in the house -> contains dog and house"},
{match: true, text: "my house is full of dogs -> contains dog and house"},
{match: true, text: "the cat is on the tree -> contains cat and tree"},
{match: false, text: "the frog is in the house -> there is no word from the first list"},
{match: false, text: "Boby is the name of my dog -> there is no word from the second list"},
{match: false, text: "Outside my house there is a tree -> there is no word from the first list"}
];
var rex = /(?:(?:dog|cat).*(?:house|tree))|(?:(?:house|tree).*(?:dog|cat))/;
tests.forEach(function(test) {
var result = rex.test(test.text);
if (!!result == !!test.match) {
console.log('GOOD: "' + test.text + '": ' + result);
} else {
console.log('BAD: "' + test.text + '": ' + result + ' (expected ' + test.match + ')');
}
});
.as-console-wrapper {
max-height: 100% !important;
}
Note that in the above we're not checking for words, just sequences of letters. If you want it to be actual words, you'll need to add word break assertions or similar. Left as an exercise to the reader...
回答2:
Here is a solution that works for any number of lists, with any number of words in them.
RegEx is for scanning linear sequences. However, you are asking two questions, which can be true or false, irrespective of the order of the patterns. Therefore, you must enumerate all permutations of regex combinations. For small number of lists, this can be done manually, as is demonstrated in the other answer. Below is the solution for the general case.
You certainly don't want to write the regex out manually, so here is a Java program that does what you want:
import java.util.*;
import java.util.stream.*;
import static java.util.Arrays.asList;
import static java.util.stream.Collectors.toList;
public class RegexWithPermutations {
/** Build a regex the checks whether
* a string contains one of the words.
*/
public static String containsWordRegex(List<String> words) {
StringBuilder sb = new StringBuilder();
boolean first = true;
for (String w: words) {
if (!first) {
sb.append("|");
}
sb.append("(?:" + w + ")");
first = false;
}
return sb.toString();
}
/** Generates all permutations of regexes.
*/
public static String allRegexPermutations(
final List<String> regexes,
final String separator
) {
class PermutationHelper {
/** Deletes one element from the array */
private int[] remove(int[] arr, int idx) {
int n = arr.length;
int[] res = new int[n - 1];
System.arraycopy(arr, 0, res, 0, idx);
System.arraycopy(arr, idx + 1, res, idx, n - idx - 1);
return res;
}
/** Helper method that generates all permutations combined with "|".
*/
public List<String> rec(String suffix, int[] unusedIndices) {
if (unusedIndices.length == 1) {
return asList(regexes.get(unusedIndices[0]) + suffix);
} else {
return IntStream.range(0, unusedIndices.length)
.boxed()
.<String>flatMap(i -> rec(
separator + regexes.get(unusedIndices[i]), // (suffix.isEmpty() ? "" : ("COMB" + suffix))
remove(unusedIndices, i)
).stream())
.collect(toList());
}
}
}
int[] startIndices = new int[regexes.size()];
for (int i = 0; i < regexes.size(); i++) {
startIndices[i] = i;
}
List<String> ps = (new PermutationHelper()).rec("", startIndices);
StringBuilder b = new StringBuilder();
boolean first = true;
for (String p : ps) {
if (!first) {
b.append("|");
}
b.append(p);
first = false;
}
return b.toString();
}
public static void main(String[] args) {
List<String> list_1 = asList("dog", "cat");
List<String> list_2 = asList("house", "tree");
List<String> examples = asList(
"the dog is in the house",
"my house is full of dogs",
"the cat is on the tree",
"the frog is in the house",
"Boby is the name of my dog",
"Outside my house there is a tree"
);
String regex = ".*(?:" + allRegexPermutations(asList(
"(?:" + containsWordRegex(list_1) + ")",
"(?:" + containsWordRegex(list_2) + ")"
), ".*") + ").*";
System.out.println("Constructed regex: " + regex);
for (String example: examples) {
System.out.println(example + " -> " + example.matches(regex));
}
}
}
Output:
Constructed regex: .*(?:(?:(?:house)|(?:tree)).*(?:(?:dog)|(?:cat))|(?:(?:dog)|(?:cat)).*(?:(?:house)|(?:tree))).*
the dog is in the house -> true
my house is full of dogs -> true
the cat is on the tree -> true
the frog is in the house -> false
Boby is the name of my dog -> false
Outside my house there is a tree -> false
It works with arbitrary numbers of lists (but the length of the regex grows hyperexponentially, so it's not advisable to use it for anything longer than 3, 4, 5).
来源:https://stackoverflow.com/questions/48563933/regex-that-matches-only-if-a-string-contains-a-word-from-each-list