pandas: Boolean indexing with multi index

前端 未结 6 1653
再見小時候
再見小時候 2021-02-08 14:52

There are many questions here with similar titles, but I couldn\'t find one that\'s addressing this issue.

I have dataframes from many different origins, and I want to f

相关标签:
6条回答
  • 2021-02-08 15:13

    You can use pd.IndexSlicer.

    >>> df.loc[pd.IndexSlice[filt[filt].index.values, :], :]
         c
    a b   
    1 1  0
      2  1
      3  2
    3 1  6
      2  7
      3  8
    

    where filt[filt].index.values is just [1, 3]. In other words

    >>> df.loc[pd.IndexSlice[[1, 3], :]]
         c
    a b   
    1 1  0
      2  1
      3  2
    3 1  6
      2  7
      3  8
    

    so if you design your filter construction a bit differently, the expression gets shorter. The advantave over Emanuele Paolini's solution df[filt[df.index.get_level_values('a')].values] is, that you have more control over the indexing.

    The topic of multiindex slicing is covered in more depth here.

    Here the full code

    import pandas as pd
    import numpy as np
    df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c':range(9)}).set_index(['a', 'b'])
    filt = pd.Series({1:True, 2:False, 3:True})
    
    print(df.loc[pd.IndexSlice[[1, 3], :]])
    print(df.loc[(df.index.levels[0].values[filt], slice(None)), :])
    print(df.loc[pd.IndexSlice[filt[filt].index.values, :], :])
    
    0 讨论(0)
  • 2021-02-08 15:14

    If the boolean series is not aligned with the dataframe you want to index it with, you can first explicitely align it with align:

    In [25]: df_aligned, filt_aligned = df.align(filt.to_frame(), level=0, axis=0)
    
    In [26]: filt_aligned
    Out[26]:
             0
    a b
    1 1   True
      2   True
      3   True
    2 1  False
      2  False
      3  False
    3 1   True
      2   True
      3   True
    

    And then you can index with it:

    In [27]: df[filt_aligned[0]]
    Out[27]:
         c
    a b
    1 1  0
      2  1
      3  2
    3 1  6
      2  7
      3  8
    

    Note: the align didn't work with a Series, therefore the to_frame in the align call, and therefore the [0] above to get back the series.

    0 讨论(0)
  • 2021-02-08 15:19

    I was facing exactly the same problem. I found this question and tried the solutions here, but none of them was efficient enough. My dataframes are: A = 700k rows x 14 cols, B = 100M rows x 3 cols. B has an MultiIndex, where the first (high) level is equal to the index of A. Let C be a slice from A of size 10k rows. My task was to get rows from B whose high-level index matches indexes of C as fast as possible. C is selected at runtime. A and B are static.

    I tried the solutions from here: get_level_values takes many seconds, df.align didn't even finish giving MemoryError (and also took seconds).

    The solution which worked for me (in ~300msec during runtime) is the following:

    1. For each index value i from A, find the first and the last (non-inclusive) positional indexes in B which contain i as the first level of MultiIndex. Store these pairs in A. This is done once and in advance. Example code:

      def construct_position_indexes(A, B):
          indexes = defaultdict(list)
          prev_index = 0
          for i, cur_index in enumerate(B.index.get_level_values(0)):
              if cur_index != prev_index:
                  indexes[cur_index].append(i)
                  if prev_index:
                      indexes[prev_index].append(i)
              prev_index = cur_index
          indexes[cur_index].append(i+1)
          index_df = pd.DataFrame(indexes.values(),
                                  index=indexes.keys(),
                                  columns=['start_index', 'end_index'], dtype=int)
          A = A.join(index_df)
          # they become floats, so we fix that
          A['start_index'] = A.start_index.fillna(0).astype(int)
          A['end_index'] = A.end_index.fillna(0).astype(int)
          return A
      
    2. At runtime, get positional boundaries from C and construct a list of all positional indexes to search for in B, and pass them to B.take():

      def get_slice(B, C):
          all_indexes = []
          for start_index, end_index in zip(
                  C.start_index.values, C.end_index.values):
              all_indexes.extend(range(start_index, end_index))
          return B.take(all_indexes)
      

    I hope it's not too complicated. Essentially, the idea is for each row in A store the range of corresponding (positional) indexes of rows in B, so that at runtime we can quickly construct the list of all positional indexes to query B by.

    This is a toy example:

    A = pd.DataFrame(range(3), columns=['dataA'], index=['A0', 'A1', 'A2'])
    print A
    
        dataA
    A0      0
    A1      1
    A2      2
    
    mindex = pd.MultiIndex.from_tuples([
        ('A0', 'B0'), ('A0', 'B1'), ('A1', 'B0'), 
        ('A2', 'B0'), ('A2', 'B1'), ('A2', 'B3')])
    B = pd.DataFrame(range(6), columns=['dataB'], index=mindex)
    print B
    
           dataB
    A0 B0      0
       B1      1
    A1 B0      2
    A2 B0      3
       B1      4
       B3      5
    
    A = construct_position_indexes(A, B)
    print A
    
        dataA  start_index  end_index
    A0      0            0          2
    A1      1            2          3
    A2      2            3          6
    
    C = A.iloc[[0, 2], :]
    print C
    
        dataA  start_index  end_index
    A0      0            0          2
    A2      2            3          6
    
    print get_slice(B, C)
    
           dataB
    A0 B0      0
       B1      1
    A2 B0      3
       B1      4
       B3      5
    
    0 讨论(0)
  • 2021-02-08 15:20

    The more readable (to my liking) solution is to reindex the boolean series (dataframe) to match index of the multi-index df:

    df.loc[filt.reindex(df.index, level='a')]
    
    0 讨论(0)
  • 2021-02-08 15:26

    If you transform your index 'a' back to a column, you can do it as follows:

    >>> df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 
                           'b':[1,2,3,1,2,3,1,2,3], 
                           'c':range(9)})
    >>> filt = pd.Series({1:True, 2:False, 3:True})
    >>> df[filt[df['a']].values]
       a  b  c
    0  1  1  0
    1  1  2  1
    2  1  3  2
    6  3  1  6
    7  3  2  7
    8  3  3  8
    

    edit. As suggested by @joris, this works also with indices. Here is the code for your sample data:

    >>> df[filt[df.index.get_level_values('a')].values]
         c
    a b   
    1 1  0
      2  1
      3  2
    3 1  6
      2  7
      3  8
    
    0 讨论(0)
  • 2021-02-08 15:32

    Simply:

    df.where(
        filt.rename_axis('a').rename('c').to_frame()
    ).dropna().astype(int)
    

    Explanation:

    • .rename_axis('a') renames the index as a (the index we want to filter by)
    • .rename('c') renames the column as c (the column that stores the values)
    • .to_frame() converts this Series into a DataFrame, for compatibility with df
    • df.where(...) filters the rows, leaving missing values (NaN) where filter is False
    • .drop_na() removes the rows with missing values (in our case where a == 2)
    • .astype(int) converts from float back to int (not sure why float to begin with)

    By the way, it seems that df.where(...) and df[...] behave similarly here, so take your pick.

    0 讨论(0)
提交回复
热议问题