pandas: Boolean indexing with multi index

前端 未结 6 1674
再見小時候
再見小時候 2021-02-08 14:52

There are many questions here with similar titles, but I couldn\'t find one that\'s addressing this issue.

I have dataframes from many different origins, and I want to f

6条回答
  •  無奈伤痛
    2021-02-08 15:19

    I was facing exactly the same problem. I found this question and tried the solutions here, but none of them was efficient enough. My dataframes are: A = 700k rows x 14 cols, B = 100M rows x 3 cols. B has an MultiIndex, where the first (high) level is equal to the index of A. Let C be a slice from A of size 10k rows. My task was to get rows from B whose high-level index matches indexes of C as fast as possible. C is selected at runtime. A and B are static.

    I tried the solutions from here: get_level_values takes many seconds, df.align didn't even finish giving MemoryError (and also took seconds).

    The solution which worked for me (in ~300msec during runtime) is the following:

    1. For each index value i from A, find the first and the last (non-inclusive) positional indexes in B which contain i as the first level of MultiIndex. Store these pairs in A. This is done once and in advance. Example code:

      def construct_position_indexes(A, B):
          indexes = defaultdict(list)
          prev_index = 0
          for i, cur_index in enumerate(B.index.get_level_values(0)):
              if cur_index != prev_index:
                  indexes[cur_index].append(i)
                  if prev_index:
                      indexes[prev_index].append(i)
              prev_index = cur_index
          indexes[cur_index].append(i+1)
          index_df = pd.DataFrame(indexes.values(),
                                  index=indexes.keys(),
                                  columns=['start_index', 'end_index'], dtype=int)
          A = A.join(index_df)
          # they become floats, so we fix that
          A['start_index'] = A.start_index.fillna(0).astype(int)
          A['end_index'] = A.end_index.fillna(0).astype(int)
          return A
      
    2. At runtime, get positional boundaries from C and construct a list of all positional indexes to search for in B, and pass them to B.take():

      def get_slice(B, C):
          all_indexes = []
          for start_index, end_index in zip(
                  C.start_index.values, C.end_index.values):
              all_indexes.extend(range(start_index, end_index))
          return B.take(all_indexes)
      

    I hope it's not too complicated. Essentially, the idea is for each row in A store the range of corresponding (positional) indexes of rows in B, so that at runtime we can quickly construct the list of all positional indexes to query B by.

    This is a toy example:

    A = pd.DataFrame(range(3), columns=['dataA'], index=['A0', 'A1', 'A2'])
    print A
    
        dataA
    A0      0
    A1      1
    A2      2
    
    mindex = pd.MultiIndex.from_tuples([
        ('A0', 'B0'), ('A0', 'B1'), ('A1', 'B0'), 
        ('A2', 'B0'), ('A2', 'B1'), ('A2', 'B3')])
    B = pd.DataFrame(range(6), columns=['dataB'], index=mindex)
    print B
    
           dataB
    A0 B0      0
       B1      1
    A1 B0      2
    A2 B0      3
       B1      4
       B3      5
    
    A = construct_position_indexes(A, B)
    print A
    
        dataA  start_index  end_index
    A0      0            0          2
    A1      1            2          3
    A2      2            3          6
    
    C = A.iloc[[0, 2], :]
    print C
    
        dataA  start_index  end_index
    A0      0            0          2
    A2      2            3          6
    
    print get_slice(B, C)
    
           dataB
    A0 B0      0
       B1      1
    A2 B0      3
       B1      4
       B3      5
    

提交回复
热议问题