Fast pandas filtering

后端 未结 3 664
后悔当初
后悔当初 2021-02-06 13:12

I want to filter a pandas dataframe, if the name column entry has an item in a given list.

Here we have a DataFrame

x = DataFrame(
    [[\'sam\', 328], [         


        
3条回答
  •  生来不讨喜
    2021-02-06 13:39

    If your data repeats a lot of values, try using the 'categorical' data type for that column and then applying boolean filtering. Much more flexible than using indices and, at least in my case, much faster.

    data = pd.read_csv('data.csv', dtype={'name':'category'})
    data[(data.name=='sam')&(data.score>1)]
    

    or

    names=['sam','ruby']    
    data[data.name.isin(names)]
    

    For the ~15 million row, ~200k unique terms dataset I'm working with in pandas 1.2, %timeit results are:

    • boolean filter on object column: 608ms
    • .loc filter on same object column as index: 281ms
    • boolean filter on same object column as 'categorical' type: 16ms

    From there, add the .sum() or whatever aggregation function you're looking for.

提交回复
热议问题