I have a data frame, in which I need to find all the possible matches rows which match with terms
. My code is
texts = [\'foo abc\', \'foobar xyz\',
The longer alternatives should come before the shorter ones, thus, you need to sort the keywords by length in the descending order:
pat = r'\b(?:{})\b'.format('|'.join(sorted(terms,key=len,reverse=True)))
The result will be \b(?:foo baz|foo|baz)\b
pattern. It will first try to match foo baz
, then foo
, then baz
. If foo baz
is found, the match is returned, then the next match is searched for from the end of the match, so you won't match foo
or baz
found with the previous match again.
See more on this in "Remember That The Regex Engine Is Eager".
Instead of using the regex pattern for checking the presence of terms,
#create pattern
p = re.compile(pat)
#search for pattern in the column
results = [p.findall(text) for text in df.Match_text.tolist()]
Try using a simple lookup of terms in the text like this.
#search for each term in the column
results = [[term for term in terms if term in text] for text in df.Match_text.tolist()]
Output for the above looks like this,
Match_text results
0 foo abc [foo]
3 baz 45 [baz]
6 foo baz [foo, baz, foo baz]
NOTE : There is a time complexity associated to this method.