I want a PCRE regex to create bigram pairings similar to this question, but without duplicates words.
Full Match: apple orange plum
Group 1: apple orange
Group 2: orange plum
The closest I’ve gotten to it is this, but ‘orange’ isn’t captured in the second group.
(\b.+\b)(\g<1>)\b
You're looking for this:
/(?=(\b\w+\s+\w+))/g
Here's a quick perl one-liner to demonstrate it:
$ perl -e 'while ("apple orange plum" =~ /(?=(\b\w+\s+\w+))/g) { print "$1\n" }'
apple orange
orange plum
This uses a zero-width lookahead (?=…)
around the capture group to ensure we can read the word "orange" twice.
If we used /(\b\w+\s+\w+)/g
instead, we'd get "apple orange" but not the second match because the left-to-right processing of the regular expression would have already passed over the word "orange"
If we omit the word break \b
, the regex interpreter would give us "apple orange" and then "pple orange", "ple orange", etc ... including "orange plum" later on, but also "range plum" through "e plum" since those all satisfy that criteria.
来源:https://stackoverflow.com/questions/54279023/regex-non-duplicate-bigrams