Pandas filtering for multiple substrings in series

前端未结

关注

 3  1796

说谎 2020-11-22 04:08

I need to filter rows in a pandas dataframe so that a specific string column contains at least one of a list of provided substrings. The substrings may have unu

3条回答

梦毁少年i (楼主)

2020-11-22 04:37
If you're sticking to using pure-pandas, for both performance and practicality I think you should use regex for this task. However, you will need to properly escape any special characters in the substrings first to ensure that they are matched literally (and not used as regex meta characters).

This is easy to do using re.escape:
```
>>> import re
>>> esc_lst = [re.escape(s) for s in lst]
```
These escaped substrings can then be joined using a regex pipe |. Each of the substrings can be checked against a string until one matches (or they have all been tested).
```
>>> pattern = '|'.join(esc_lst)
```
The masking stage then becomes a single low-level loop through the rows:
```
df[col].str.contains(pattern, case=False)
```
Here's a simple setup to get a sense of performance:
```
from random import randint, seed

seed(321)

# 100 substrings of 5 characters
lst = [''.join([chr(randint(0, 256)) for _ in range(5)]) for _ in range(100)]

# 50000 strings of 20 characters
strings = [''.join([chr(randint(0, 256)) for _ in range(20)]) for _ in range(50000)]

col = pd.Series(strings)
esc_lst = [re.escape(s) for s in lst]
pattern = '|'.join(esc_lst)
```
The proposed method takes about 1 second (so maybe up to 20 seconds for 1 million rows):
```
%timeit col.str.contains(pattern, case=False)
1 loop, best of 3: 981 ms per loop
```
The method in the question took approximately 5 seconds using the same input data.

It's worth noting that these times are 'worst case' in the sense that there were no matches (so all substrings were checked). If there are matches than the timing will improve.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...