I often need to filter pandas dataframe df
by df[df[\'col_name\']==\'string_value\']
, and I want to speed up the row selction operation, is there a qui
I have long wanted to add binary search indexes to DataFrame objects. You can take the DIY approach of sorting by the column and doing this yourself:
In [11]: df = df.sort('STK_ID') # skip this if you're sure it's sorted
In [12]: df['STK_ID'].searchsorted('A0003', 'left')
Out[12]: 6000
In [13]: df['STK_ID'].searchsorted('A0003', 'right')
Out[13]: 8000
In [14]: timeit df[6000:8000]
10000 loops, best of 3: 134 µs per loop
This is fast because it always retrieves views and does not copy any data.
Depending on what you want to do with the selection afterwards, and if you have to make multiple selections of this kind, the groupby
functionality can also make things faster (at least with the example).
Even if you only have to select the rows for one string_value, it is a little bit faster (but not much):
In [11]: %timeit df[df['STK_ID']=='A0003']
1 loops, best of 3: 626 ms per loop
In [12]: %timeit df.groupby("STK_ID").get_group("A0003")
1 loops, best of 3: 459 ms per loop
But subsequent calls to the GroupBy object will be very fast (eg to select the rows of other sting_values):
In [25]: grouped = df.groupby("STK_ID")
In [26]: %timeit grouped.get_group("A0003")
1 loops, best of 3: 333 us per loop
Somewhat surprisingly, working with the .values
array instead of the Series
is much faster for me:
>>> time df = mul_df(3000, 2000, 3).reset_index()
CPU times: user 5.96 s, sys: 0.81 s, total: 6.78 s
Wall time: 6.78 s
>>> timeit df[df["STK_ID"] == "A0003"]
1 loops, best of 3: 841 ms per loop
>>> timeit df[df["STK_ID"].values == "A0003"]
1 loops, best of 3: 210 ms per loop