I want to filter a pandas dataframe, if the name column entry has an item in a given list.
Here we have a DataFrame
x = DataFrame(
[[\'sam\', 328], [
If your data repeats a lot of values, try using the 'categorical' data type for that column and then applying boolean filtering. Much more flexible than using indices and, at least in my case, much faster.
data = pd.read_csv('data.csv', dtype={'name':'category'})
data[(data.name=='sam')&(data.score>1)]
or
names=['sam','ruby']
data[data.name.isin(names)]
For the ~15 million row, ~200k unique terms dataset I'm working with in pandas 1.2, %timeit results are:
From there, add the .sum() or whatever aggregation function you're looking for.
Try using isin (thanks to DSM for suggesting loc
over ix
here):
In [78]: x = pd.DataFrame([['sam',328],['ruby',3213],['jon',121]], columns = ['name', 'score'])
In [79]: names = ['sam', 'ruby']
In [80]: x['name'].isin(names)
Out[80]:
0 True
1 True
2 False
Name: name, dtype: bool
In [81]: x.loc[x['name'].isin(names), 'score'].sum()
Out[81]: 3541
CT Zhu suggests a faster alternative using np.in1d
:
In [105]: y = pd.concat([x]*1000)
In [109]: %timeit y.loc[y['name'].isin(names), 'score'].sum()
1000 loops, best of 3: 413 µs per loop
In [110]: %timeit y.loc[np.in1d(y['name'], names), 'score'].sum()
1000 loops, best of 3: 335 µs per loop
If I need to search on a field, I have noticed that it helps immensely if I change the index of the DataFrame to the search field. For one of my search and lookup requirements I got a performance improvement of around 500%.
So in your case the following could be used to search and filter by name.
df = pd.DataFrame([['sam', 328], ['ruby', 3213], ['jon', 121]],
columns=['name', 'score'])
names = ['sam', 'ruby']
df_searchable = df.set_index('name')
df_searchable[df_searchable.index.isin(names)]