How to speed up pandas row filtering by string matching?

前端未结

关注

 3  1260

I often need to filter pandas dataframe df by df[df[\'col_name\']==\'string_value\'], and I want to speed up the row selction operation, is there a qui

相关标签:

3条回答

孤街浪徒

2021-01-31 23:02
I have long wanted to add binary search indexes to DataFrame objects. You can take the DIY approach of sorting by the column and doing this yourself:
```
In [11]: df = df.sort('STK_ID') # skip this if you're sure it's sorted

In [12]: df['STK_ID'].searchsorted('A0003', 'left')
Out[12]: 6000

In [13]: df['STK_ID'].searchsorted('A0003', 'right')
Out[13]: 8000

In [14]: timeit df[6000:8000]
10000 loops, best of 3: 134 µs per loop
```
This is fast because it always retrieves views and does not copy any data.
0 讨论(0)
发布评论:

提交评论
- 加载中...
情书的邮戳

2021-01-31 23:14
Depending on what you want to do with the selection afterwards, and if you have to make multiple selections of this kind, the groupby functionality can also make things faster (at least with the example).

Even if you only have to select the rows for one string_value, it is a little bit faster (but not much):
```
In [11]: %timeit df[df['STK_ID']=='A0003']
1 loops, best of 3: 626 ms per loop

In [12]: %timeit df.groupby("STK_ID").get_group("A0003")
1 loops, best of 3: 459 ms per loop
```
But subsequent calls to the GroupBy object will be very fast (eg to select the rows of other sting_values):
```
In [25]: grouped = df.groupby("STK_ID")

In [26]: %timeit grouped.get_group("A0003")
1 loops, best of 3: 333 us per loop
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

甜味超标

2021-01-31 23:24

Somewhat surprisingly, working with the .values array instead of the Series is much faster for me:

>>> time df = mul_df(3000, 2000, 3).reset_index()
CPU times: user 5.96 s, sys: 0.81 s, total: 6.78 s
Wall time: 6.78 s
>>> timeit df[df["STK_ID"] == "A0003"]
1 loops, best of 3: 841 ms per loop
>>> timeit df[df["STK_ID"].values == "A0003"]
1 loops, best of 3: 210 ms per loop

0 讨论(0)