pandas: Boolean indexing with multi index

前端未结

关注

 6  1674

再見小時候 2021-02-08 14:52

There are many questions here with similar titles, but I couldn\'t find one that\'s addressing this issue.

I have dataframes from many different origins, and I want to f

6条回答

無奈伤痛 (楼主)

2021-02-08 15:19

I was facing exactly the same problem. I found this question and tried the solutions here, but none of them was efficient enough. My dataframes are: A = 700k rows x 14 cols, B = 100M rows x 3 cols. B has an MultiIndex, where the first (high) level is equal to the index of A. Let C be a slice from A of size 10k rows. My task was to get rows from B whose high-level index matches indexes of C as fast as possible. C is selected at runtime. A and B are static.

I tried the solutions from here: get_level_values takes many seconds, df.align didn't even finish giving MemoryError (and also took seconds).

The solution which worked for me (in ~300msec during runtime) is the following:

For each index value i from A, find the first and the last (non-inclusive) positional indexes in B which contain i as the first level of MultiIndex. Store these pairs in A. This is done once and in advance. Example code:

def construct_position_indexes(A, B):
    indexes = defaultdict(list)
    prev_index = 0
    for i, cur_index in enumerate(B.index.get_level_values(0)):
        if cur_index != prev_index:
            indexes[cur_index].append(i)
            if prev_index:
                indexes[prev_index].append(i)
        prev_index = cur_index
    indexes[cur_index].append(i+1)
    index_df = pd.DataFrame(indexes.values(),
                            index=indexes.keys(),
                            columns=['start_index', 'end_index'], dtype=int)
    A = A.join(index_df)
    # they become floats, so we fix that
    A['start_index'] = A.start_index.fillna(0).astype(int)
    A['end_index'] = A.end_index.fillna(0).astype(int)
    return A

At runtime, get positional boundaries from C and construct a list of all positional indexes to search for in B, and pass them to B.take():

def get_slice(B, C):
    all_indexes = []
    for start_index, end_index in zip(
            C.start_index.values, C.end_index.values):
        all_indexes.extend(range(start_index, end_index))
    return B.take(all_indexes)

I hope it's not too complicated. Essentially, the idea is for each row in A store the range of corresponding (positional) indexes of rows in B, so that at runtime we can quickly construct the list of all positional indexes to query B by.

This is a toy example:

A = pd.DataFrame(range(3), columns=['dataA'], index=['A0', 'A1', 'A2'])
print A

    dataA
A0      0
A1      1
A2      2

mindex = pd.MultiIndex.from_tuples([
    ('A0', 'B0'), ('A0', 'B1'), ('A1', 'B0'), 
    ('A2', 'B0'), ('A2', 'B1'), ('A2', 'B3')])
B = pd.DataFrame(range(6), columns=['dataB'], index=mindex)
print B

       dataB
A0 B0      0
   B1      1
A1 B0      2
A2 B0      3
   B1      4
   B3      5

A = construct_position_indexes(A, B)
print A

    dataA  start_index  end_index
A0      0            0          2
A1      1            2          3
A2      2            3          6

C = A.iloc[[0, 2], :]
print C

    dataA  start_index  end_index
A0      0            0          2
A2      2            3          6

print get_slice(B, C)

       dataB
A0 B0      0
   B1      1
A2 B0      3
   B1      4
   B3      5

0 讨论(0)

查看其它6个回答