pandas: Boolean indexing with multi index

前端未结

关注

 6  1664

再見小時候

There are many questions here with similar titles, but I couldn\'t find one that\'s addressing this issue.

I have dataframes from many different origins, and I want to f

相关标签:

6条回答

半阙折子戏

2021-02-08 15:13

You can use pd.IndexSlicer.

>>> df.loc[pd.IndexSlice[filt[filt].index.values, :], :]
     c
a b   
1 1  0
  2  1
  3  2
3 1  6
  2  7
  3  8

where filt[filt].index.values is just [1, 3]. In other words

>>> df.loc[pd.IndexSlice[[1, 3], :]]
     c
a b   
1 1  0
  2  1
  3  2
3 1  6
  2  7
  3  8

so if you design your filter construction a bit differently, the expression gets shorter. The advantave over Emanuele Paolini's solution df[filt[df.index.get_level_values('a')].values] is, that you have more control over the indexing.

The topic of multiindex slicing is covered in more depth here.

Here the full code

import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c':range(9)}).set_index(['a', 'b'])
filt = pd.Series({1:True, 2:False, 3:True})

print(df.loc[pd.IndexSlice[[1, 3], :]])
print(df.loc[(df.index.levels[0].values[filt], slice(None)), :])
print(df.loc[pd.IndexSlice[filt[filt].index.values, :], :])

0 讨论(0)

悲哀的现实

2021-02-08 15:14
If the boolean series is not aligned with the dataframe you want to index it with, you can first explicitely align it with align:
```
In [25]: df_aligned, filt_aligned = df.align(filt.to_frame(), level=0, axis=0)

In [26]: filt_aligned
Out[26]:
         0
a b
1 1   True
  2   True
  3   True
2 1  False
  2  False
  3  False
3 1   True
  2   True
  3   True
```
And then you can index with it:
```
In [27]: df[filt_aligned[0]]
Out[27]:
     c
a b
1 1  0
  2  1
  3  2
3 1  6
  2  7
  3  8
```
Note: the align didn't work with a Series, therefore the to_frame in the align call, and therefore the [0] above to get back the series.
0 讨论(0)
发布评论:

提交评论
- 加载中...

無奈伤痛

2021-02-08 15:19

I was facing exactly the same problem. I found this question and tried the solutions here, but none of them was efficient enough. My dataframes are: A = 700k rows x 14 cols, B = 100M rows x 3 cols. B has an MultiIndex, where the first (high) level is equal to the index of A. Let C be a slice from A of size 10k rows. My task was to get rows from B whose high-level index matches indexes of C as fast as possible. C is selected at runtime. A and B are static.

I tried the solutions from here: get_level_values takes many seconds, df.align didn't even finish giving MemoryError (and also took seconds).

The solution which worked for me (in ~300msec during runtime) is the following:

For each index value i from A, find the first and the last (non-inclusive) positional indexes in B which contain i as the first level of MultiIndex. Store these pairs in A. This is done once and in advance. Example code:

def construct_position_indexes(A, B):
    indexes = defaultdict(list)
    prev_index = 0
    for i, cur_index in enumerate(B.index.get_level_values(0)):
        if cur_index != prev_index:
            indexes[cur_index].append(i)
            if prev_index:
                indexes[prev_index].append(i)
        prev_index = cur_index
    indexes[cur_index].append(i+1)
    index_df = pd.DataFrame(indexes.values(),
                            index=indexes.keys(),
                            columns=['start_index', 'end_index'], dtype=int)
    A = A.join(index_df)
    # they become floats, so we fix that
    A['start_index'] = A.start_index.fillna(0).astype(int)
    A['end_index'] = A.end_index.fillna(0).astype(int)
    return A

At runtime, get positional boundaries from C and construct a list of all positional indexes to search for in B, and pass them to B.take():

def get_slice(B, C):
    all_indexes = []
    for start_index, end_index in zip(
            C.start_index.values, C.end_index.values):
        all_indexes.extend(range(start_index, end_index))
    return B.take(all_indexes)

I hope it's not too complicated. Essentially, the idea is for each row in A store the range of corresponding (positional) indexes of rows in B, so that at runtime we can quickly construct the list of all positional indexes to query B by.

This is a toy example:

A = pd.DataFrame(range(3), columns=['dataA'], index=['A0', 'A1', 'A2'])
print A

    dataA
A0      0
A1      1
A2      2

mindex = pd.MultiIndex.from_tuples([
    ('A0', 'B0'), ('A0', 'B1'), ('A1', 'B0'), 
    ('A2', 'B0'), ('A2', 'B1'), ('A2', 'B3')])
B = pd.DataFrame(range(6), columns=['dataB'], index=mindex)
print B

       dataB
A0 B0      0
   B1      1
A1 B0      2
A2 B0      3
   B1      4
   B3      5

A = construct_position_indexes(A, B)
print A

    dataA  start_index  end_index
A0      0            0          2
A1      1            2          3
A2      2            3          6

C = A.iloc[[0, 2], :]
print C

    dataA  start_index  end_index
A0      0            0          2
A2      2            3          6

print get_slice(B, C)

       dataB
A0 B0      0
   B1      1
A2 B0      3
   B1      4
   B3      5

0 讨论(0)

温柔的废话

2021-02-08 15:20
The more readable (to my liking) solution is to reindex the boolean series (dataframe) to match index of the multi-index df:
```
df.loc[filt.reindex(df.index, level='a')]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

你的背包

2021-02-08 15:26

If you transform your index 'a' back to a column, you can do it as follows:

>>> df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 
                       'b':[1,2,3,1,2,3,1,2,3], 
                       'c':range(9)})
>>> filt = pd.Series({1:True, 2:False, 3:True})
>>> df[filt[df['a']].values]
   a  b  c
0  1  1  0
1  1  2  1
2  1  3  2
6  3  1  6
7  3  2  7
8  3  3  8

edit. As suggested by @joris, this works also with indices. Here is the code for your sample data:

>>> df[filt[df.index.get_level_values('a')].values]
     c
a b   
1 1  0
  2  1
  3  2
3 1  6
  2  7
  3  8

0 讨论(0)

别跟我提以往

2021-02-08 15:32
Simply:
```
df.where(
    filt.rename_axis('a').rename('c').to_frame()
).dropna().astype(int)
```
Explanation:
- .rename_axis('a') renames the index as a (the index we want to filter by)
- .rename('c') renames the column as c (the column that stores the values)
- .to_frame() converts this Series into a DataFrame, for compatibility with df
- df.where(...) filters the rows, leaving missing values (NaN) where filter is False
- .drop_na() removes the rows with missing values (in our case where a == 2)
- .astype(int) converts from float back to int (not sure why float to begin with)
By the way, it seems that df.where(...) and df[...] behave similarly here, so take your pick.
0 讨论(0)
发布评论:

提交评论
- 加载中...