I have a data frame, in which I need to find all the possible matches rows which match with terms
. My code is
texts = [\'foo abc\', \'foobar xyz\',
The longer alternatives should come before the shorter ones, thus, you need to sort the keywords by length in the descending order:
pat = r'\b(?:{})\b'.format('|'.join(sorted(terms,key=len,reverse=True)))
The result will be \b(?:foo baz|foo|baz)\b
pattern. It will first try to match foo baz
, then foo
, then baz
. If foo baz
is found, the match is returned, then the next match is searched for from the end of the match, so you won't match foo
or baz
found with the previous match again.
See more on this in "Remember That The Regex Engine Is Eager".