There are many questions here with similar titles, but I couldn\'t find one that\'s addressing this issue.
I have dataframes from many different origins, and I want to f
You can use pd.IndexSlicer.
>>> df.loc[pd.IndexSlice[filt[filt].index.values, :], :]
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8
where filt[filt].index.values
is just [1, 3]
. In other words
>>> df.loc[pd.IndexSlice[[1, 3], :]]
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8
so if you design your filter construction a bit differently, the expression gets shorter. The advantave over Emanuele Paolini's solution df[filt[df.index.get_level_values('a')].values]
is, that you have more control over the indexing.
The topic of multiindex slicing is covered in more depth here.
Here the full code
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c':range(9)}).set_index(['a', 'b'])
filt = pd.Series({1:True, 2:False, 3:True})
print(df.loc[pd.IndexSlice[[1, 3], :]])
print(df.loc[(df.index.levels[0].values[filt], slice(None)), :])
print(df.loc[pd.IndexSlice[filt[filt].index.values, :], :])
If the boolean series is not aligned with the dataframe you want to index it with, you can first explicitely align it with align:
In [25]: df_aligned, filt_aligned = df.align(filt.to_frame(), level=0, axis=0)
In [26]: filt_aligned
Out[26]:
0
a b
1 1 True
2 True
3 True
2 1 False
2 False
3 False
3 1 True
2 True
3 True
And then you can index with it:
In [27]: df[filt_aligned[0]]
Out[27]:
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8
Note: the align didn't work with a Series, therefore the to_frame
in the align call, and therefore the [0]
above to get back the series.
I was facing exactly the same problem. I found this question and tried the solutions here, but none of them was efficient enough. My dataframes are: A = 700k rows x 14 cols
, B = 100M rows x 3 cols
. B
has an MultiIndex
, where the first (high) level is equal to the index of A
. Let C
be a slice from A
of size 10k
rows. My task was to get rows from B
whose high-level index matches indexes of C
as fast as possible. C
is selected at runtime. A
and B
are static.
I tried the solutions from here: get_level_values
takes many seconds, df.align
didn't even finish giving MemoryError
(and also took seconds).
The solution which worked for me (in ~300msec
during runtime) is the following:
For each index
value i
from A
, find the first and the last (non-inclusive) positional indexes in B
which contain i
as the first level of MultiIndex. Store these pairs in A
. This is done once and in advance.
Example code:
def construct_position_indexes(A, B):
indexes = defaultdict(list)
prev_index = 0
for i, cur_index in enumerate(B.index.get_level_values(0)):
if cur_index != prev_index:
indexes[cur_index].append(i)
if prev_index:
indexes[prev_index].append(i)
prev_index = cur_index
indexes[cur_index].append(i+1)
index_df = pd.DataFrame(indexes.values(),
index=indexes.keys(),
columns=['start_index', 'end_index'], dtype=int)
A = A.join(index_df)
# they become floats, so we fix that
A['start_index'] = A.start_index.fillna(0).astype(int)
A['end_index'] = A.end_index.fillna(0).astype(int)
return A
At runtime, get positional boundaries from C
and construct a list of all positional indexes to search for in B
, and pass them to B.take()
:
def get_slice(B, C):
all_indexes = []
for start_index, end_index in zip(
C.start_index.values, C.end_index.values):
all_indexes.extend(range(start_index, end_index))
return B.take(all_indexes)
I hope it's not too complicated. Essentially, the idea is for each row in A
store the range of corresponding (positional) indexes of rows in B
, so that at runtime we can quickly construct the list of all positional indexes to query B
by.
This is a toy example:
A = pd.DataFrame(range(3), columns=['dataA'], index=['A0', 'A1', 'A2'])
print A
dataA
A0 0
A1 1
A2 2
mindex = pd.MultiIndex.from_tuples([
('A0', 'B0'), ('A0', 'B1'), ('A1', 'B0'),
('A2', 'B0'), ('A2', 'B1'), ('A2', 'B3')])
B = pd.DataFrame(range(6), columns=['dataB'], index=mindex)
print B
dataB
A0 B0 0
B1 1
A1 B0 2
A2 B0 3
B1 4
B3 5
A = construct_position_indexes(A, B)
print A
dataA start_index end_index
A0 0 0 2
A1 1 2 3
A2 2 3 6
C = A.iloc[[0, 2], :]
print C
dataA start_index end_index
A0 0 0 2
A2 2 3 6
print get_slice(B, C)
dataB
A0 B0 0
B1 1
A2 B0 3
B1 4
B3 5
The more readable (to my liking) solution is to reindex the boolean series (dataframe) to match index of the multi-index df:
df.loc[filt.reindex(df.index, level='a')]
If you transform your index 'a' back to a column, you can do it as follows:
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3],
'b':[1,2,3,1,2,3,1,2,3],
'c':range(9)})
>>> filt = pd.Series({1:True, 2:False, 3:True})
>>> df[filt[df['a']].values]
a b c
0 1 1 0
1 1 2 1
2 1 3 2
6 3 1 6
7 3 2 7
8 3 3 8
edit. As suggested by @joris, this works also with indices. Here is the code for your sample data:
>>> df[filt[df.index.get_level_values('a')].values]
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8
Simply:
df.where(
filt.rename_axis('a').rename('c').to_frame()
).dropna().astype(int)
Explanation:
.rename_axis('a')
renames the index as a
(the index we want to filter by).rename('c')
renames the column as c
(the column that stores the values).to_frame()
converts this Series into a DataFrame, for compatibility with df
df.where(...)
filters the rows, leaving missing values (NaN
) where filter is False
.drop_na()
removes the rows with missing values (in our case where a == 2
).astype(int)
converts from float
back to int
(not sure why float
to begin with)By the way, it seems that df.where(...)
and df[...]
behave similarly here, so take your pick.