How to speed up pandas row filtering by string matching?

前端 未结 3 1253
-上瘾入骨i
-上瘾入骨i 2021-01-31 22:47

I often need to filter pandas dataframe df by df[df[\'col_name\']==\'string_value\'], and I want to speed up the row selction operation, is there a qui

相关标签:
3条回答
  • 2021-01-31 23:02

    I have long wanted to add binary search indexes to DataFrame objects. You can take the DIY approach of sorting by the column and doing this yourself:

    In [11]: df = df.sort('STK_ID') # skip this if you're sure it's sorted
    
    In [12]: df['STK_ID'].searchsorted('A0003', 'left')
    Out[12]: 6000
    
    In [13]: df['STK_ID'].searchsorted('A0003', 'right')
    Out[13]: 8000
    
    In [14]: timeit df[6000:8000]
    10000 loops, best of 3: 134 µs per loop
    

    This is fast because it always retrieves views and does not copy any data.

    0 讨论(0)
  • 2021-01-31 23:14

    Depending on what you want to do with the selection afterwards, and if you have to make multiple selections of this kind, the groupby functionality can also make things faster (at least with the example).

    Even if you only have to select the rows for one string_value, it is a little bit faster (but not much):

    In [11]: %timeit df[df['STK_ID']=='A0003']
    1 loops, best of 3: 626 ms per loop
    
    In [12]: %timeit df.groupby("STK_ID").get_group("A0003")
    1 loops, best of 3: 459 ms per loop
    

    But subsequent calls to the GroupBy object will be very fast (eg to select the rows of other sting_values):

    In [25]: grouped = df.groupby("STK_ID")
    
    In [26]: %timeit grouped.get_group("A0003")
    1 loops, best of 3: 333 us per loop
    
    0 讨论(0)
  • 2021-01-31 23:24

    Somewhat surprisingly, working with the .values array instead of the Series is much faster for me:

    >>> time df = mul_df(3000, 2000, 3).reset_index()
    CPU times: user 5.96 s, sys: 0.81 s, total: 6.78 s
    Wall time: 6.78 s
    >>> timeit df[df["STK_ID"] == "A0003"]
    1 loops, best of 3: 841 ms per loop
    >>> timeit df[df["STK_ID"].values == "A0003"]
    1 loops, best of 3: 210 ms per loop
    
    0 讨论(0)
提交回复
热议问题