问题
I have a column called 'text' in my dataframe, where there is a lot of things written. I am trying to verify if in this column there is any of the strings from a list of patterns (e.g pattern1, pattern2, pattern3). I hope to create another boolean column stating if any of those patterns were found or not.
But, an important thing is to match the pattern when there are little mistyping issues. For example, if in my list of patterns I have 'mickey' and 'mouse', I want it to match with 'm0use' and 'muckey' too, not only the full correct pattern string.
I tried this, using regex lib:
import regex
list_of_patterns = ['pattern1','pattern2','pattern3','pattern4']
df['contains_any_pattern'] = df['text'].apply(lambda x: regex.search(pattern=('^(' + '|'.join(list_of_patterns) + ').${e<=2:[a-zA-Z]}'),string=x,flags=re.IGNORECASE))
I checked the text afterwards and could se that this is not working. Does anyone have a better idea to solve this problem?
Here is a short example:
df = pd.DataFrame({'id':[1,2,3,4,5],
'text':['my name is mickey mouse',
'my name is donkey kong',
'my name is mockey',
'my surname is m0use',
'hey, its me, mario!'
]})
list_of_patterns = ['mickey','mouse']
df['contains_pattern'] = df['text'].apply(lambda x: regex.search(pattern=r'(?i)^('+ '|'.join(list_of_patterns) +'){s<=2:[a-zA-Z]}',string=x))
And here is the resulting df:
id text contains_pattern
1 my name is mickey mouse None
2 my name is donkey kong None
3 my name is mockey None
4 my surname is m0use None
5 hey,its me, mario None
回答1:
You can fix the code by using something like
df['contains_any_pattern'] = df['text'].apply(lambda x: regex.search(r'(?i)\b(?:' + '|'.join(list_of_patterns) + r'){e<=2}\b', x))
Or, if the search words may contain special chars use
pat = r'(?i)(?<!\w)(?:' + '|'.join([re.escape(p) for p in list_of_patterns]) + r'){e<=2}(?!\w)'
df['contains_any_pattern'] = df['text'].apply(lambda x: regex.search(pat, x))
The pattern will look like (?i)\b(?:mouse|mickey){e<=2}\b
now. Adjust as you see fit, but make sure that the quantifier is right after the group.
The re.IGNORECASE
is from the re
package, you may simply use the inline modifier, (?i)
, to enable case insensitive matching with the current regex
library.
If you need to handle hundreds or thousands of search terms, you may leverage the approach described in Speed up millions of regex replacements in Python 3.
来源:https://stackoverflow.com/questions/59570950/how-do-check-if-a-text-column-in-my-dataframe-contains-a-list-of-possible-patte