How do check if a text column in my dataframe, contains a list of possible patterns, allowing mistyping?

问题

I have a column called 'text' in my dataframe, where there is a lot of things written. I am trying to verify if in this column there is any of the strings from a list of patterns (e.g pattern1, pattern2, pattern3). I hope to create another boolean column stating if any of those patterns were found or not.

But, an important thing is to match the pattern when there are little mistyping issues. For example, if in my list of patterns I have 'mickey' and 'mouse', I want it to match with 'm0use' and 'muckey' too, not only the full correct pattern string.

I tried this, using regex lib:

import regex
list_of_patterns = ['pattern1','pattern2','pattern3','pattern4']
df['contains_any_pattern'] = df['text'].apply(lambda x: regex.search(pattern=('^(' + '|'.join(list_of_patterns) + ').${e<=2:[a-zA-Z]}'),string=x,flags=re.IGNORECASE))

I checked the text afterwards and could se that this is not working. Does anyone have a better idea to solve this problem?

Here is a short example:

df = pd.DataFrame({'id':[1,2,3,4,5],
                      'text':['my name is mickey mouse',
                              'my name is donkey kong',
                              'my name is mockey',
                              'my surname is m0use',
                              'hey, its me, mario!'
                             ]})

list_of_patterns = ['mickey','mouse']    
df['contains_pattern'] = df['text'].apply(lambda x: regex.search(pattern=r'(?i)^('+ '|'.join(list_of_patterns) +'){s<=2:[a-zA-Z]}',string=x))

And here is the resulting df:

id                       text      contains_pattern
1     my name is mickey mouse                  None
2      my name is donkey kong                  None
3           my name is mockey                  None
4         my surname is m0use                  None
5           hey,its me, mario                  None

回答1:

You can fix the code by using something like

df['contains_any_pattern'] = df['text'].apply(lambda x: regex.search(r'(?i)\b(?:' + '|'.join(list_of_patterns) + r'){e<=2}\b', x))

Or, if the search words may contain special chars use

pat = r'(?i)(?<!\w)(?:' + '|'.join([re.escape(p) for p in list_of_patterns]) + r'){e<=2}(?!\w)'
df['contains_any_pattern'] = df['text'].apply(lambda x: regex.search(pat, x))

The pattern will look like (?i)\b(?:mouse|mickey){e<=2}\b now. Adjust as you see fit, but make sure that the quantifier is right after the group.

The re.IGNORECASE is from the re package, you may simply use the inline modifier, (?i), to enable case insensitive matching with the current regex library.

If you need to handle hundreds or thousands of search terms, you may leverage the approach described in Speed up millions of regex replacements in Python 3.

来源：https://stackoverflow.com/questions/59570950/how-do-check-if-a-text-column-in-my-dataframe-contains-a-list-of-possible-patte

标签

python

regex

python-3.x

dataframe

text-mining