I need to filter rows in a pandas
dataframe so that a specific string column contains at least one of a list of provided substrings. The substrings may have unu
I want to find all elements of a pd.Series
, v
, that contain "at" or "Og". And get 1 if the element contains the pattern or 0 if it doesn't.
re
:
import re
My vector:
v=pd.Series(['cAt','dog','the rat','mouse','froG'])
[Out]:
0 cAt
1 dog
2 the rat
3 mouse
4 froG
I want to find all elements of v that contain "at" or "Og".
This is, I can define my pattern
as:
pattern='at|Og'
Since I want a vector with 1s if the item contains the pattern or 0 if don't.
I create an unitary vector with the same length as v:
v_binary=[1]*len(v)
I obtain a boolenean s
that is True
if one element of v
contains the pattern
or False
if it doesn't contain it.
s=v.str.contains(pattern, flags=re.IGNORECASE, regex=True)
To obtain the binary vector I multiply the v_binary
*s
:
v_binary*s
[Out]
0 1
1 1
2 1
3 0
4 1
You could try using the Aho-Corasick algorithm. In the average case, it is O(n+m+p)
where n
is length of the search strings and m
is the length of the searched text and p
is the number of output matches.
The Aho-Corasick algorithm is often used to find multiple patterns (needles) in an input text (the haystack).
pyahocorasick is a Python wrapper around a C implementation of the algorithm.
Let's compare how fast it is versus some alternatives. Below is a benchmark
showing using_aho_corasick
to be over 30x faster than the original method
(shown in the question) on a 50K-row DataFrame test case:
| | speed factor | ms per loop |
| | compared to orig | |
|--------------------+------------------+-------------|
| using_aho_corasick | 30.7x | 140 |
| using_regex | 2.7x | 1580 |
| orig | 1.0x | 4300 |
In [89]: %timeit using_ahocorasick(col, lst)
10 loops, best of 3: 140 ms per loop
In [88]: %timeit using_regex(col, lst)
1 loop, best of 3: 1.58 s per loop
In [91]: %timeit orig(col, lst)
1 loop, best of 3: 4.3 s per loop
Here the setup used for the benchmark. It also verifies that the output matches the result returned by orig
:
import numpy as np
import random
import pandas as pd
import ahocorasick
import re
random.seed(321)
def orig(col, lst):
mask = np.logical_or.reduce([col.str.contains(i, regex=False, case=False)
for i in lst])
return mask
def using_regex(col, lst):
"""https://stackoverflow.com/a/48590850/190597 (Alex Riley)"""
esc_lst = [re.escape(s) for s in lst]
pattern = '|'.join(esc_lst)
mask = col.str.contains(pattern, case=False)
return mask
def using_ahocorasick(col, lst):
A = ahocorasick.Automaton(ahocorasick.STORE_INTS)
for word in lst:
A.add_word(word.lower())
A.make_automaton()
col = col.str.lower()
mask = col.apply(lambda x: bool(list(A.iter(x))))
return mask
N = 50000
# 100 substrings of 5 characters
lst = [''.join([chr(random.randint(0, 256)) for _ in range(5)]) for _ in range(100)]
# N strings of 20 characters
strings = [''.join([chr(random.randint(0, 256)) for _ in range(20)]) for _ in range(N)]
# make about 10% of the strings match a string from lst; this helps check that our method works
strings = [_ if random.randint(0, 99) < 10 else _+random.choice(lst) for _ in strings]
col = pd.Series(strings)
expected = orig(col, lst)
for name, result in [('using_regex', using_regex(col, lst)),
('using_ahocorasick', using_ahocorasick(col, lst))]:
status = 'pass' if np.allclose(expected, result) else 'fail'
print('{}: {}'.format(name, status))
If you're sticking to using pure-pandas, for both performance and practicality I think you should use regex for this task. However, you will need to properly escape any special characters in the substrings first to ensure that they are matched literally (and not used as regex meta characters).
This is easy to do using re.escape:
>>> import re
>>> esc_lst = [re.escape(s) for s in lst]
These escaped substrings can then be joined using a regex pipe |
. Each of the substrings can be checked against a string until one matches (or they have all been tested).
>>> pattern = '|'.join(esc_lst)
The masking stage then becomes a single low-level loop through the rows:
df[col].str.contains(pattern, case=False)
Here's a simple setup to get a sense of performance:
from random import randint, seed
seed(321)
# 100 substrings of 5 characters
lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)]
# 50000 strings of 20 characters
strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)]
col = pd.Series(strings)
esc_lst = [re.escape(s) for s in lst]
pattern = '|'.join(esc_lst)
The proposed method takes about 1 second (so maybe up to 20 seconds for 1 million rows):
%timeit col.str.contains(pattern, case=False)
1 loop, best of 3: 981 ms per loop
The method in the question took approximately 5 seconds using the same input data.
It's worth noting that these times are 'worst case' in the sense that there were no matches (so all substrings were checked). If there are matches than the timing will improve.