There are many questions here with similar titles, but I couldn\'t find one that\'s addressing this issue.
I have dataframes from many different origins, and I want to f
I was facing exactly the same problem. I found this question and tried the solutions here, but none of them was efficient enough. My dataframes are: A = 700k rows x 14 cols
, B = 100M rows x 3 cols
. B
has an MultiIndex
, where the first (high) level is equal to the index of A
. Let C
be a slice from A
of size 10k
rows. My task was to get rows from B
whose high-level index matches indexes of C
as fast as possible. C
is selected at runtime. A
and B
are static.
I tried the solutions from here: get_level_values
takes many seconds, df.align
didn't even finish giving MemoryError
(and also took seconds).
The solution which worked for me (in ~300msec
during runtime) is the following:
For each index
value i
from A
, find the first and the last (non-inclusive) positional indexes in B
which contain i
as the first level of MultiIndex. Store these pairs in A
. This is done once and in advance.
Example code:
def construct_position_indexes(A, B):
indexes = defaultdict(list)
prev_index = 0
for i, cur_index in enumerate(B.index.get_level_values(0)):
if cur_index != prev_index:
indexes[cur_index].append(i)
if prev_index:
indexes[prev_index].append(i)
prev_index = cur_index
indexes[cur_index].append(i+1)
index_df = pd.DataFrame(indexes.values(),
index=indexes.keys(),
columns=['start_index', 'end_index'], dtype=int)
A = A.join(index_df)
# they become floats, so we fix that
A['start_index'] = A.start_index.fillna(0).astype(int)
A['end_index'] = A.end_index.fillna(0).astype(int)
return A
At runtime, get positional boundaries from C
and construct a list of all positional indexes to search for in B
, and pass them to B.take()
:
def get_slice(B, C):
all_indexes = []
for start_index, end_index in zip(
C.start_index.values, C.end_index.values):
all_indexes.extend(range(start_index, end_index))
return B.take(all_indexes)
I hope it's not too complicated. Essentially, the idea is for each row in A
store the range of corresponding (positional) indexes of rows in B
, so that at runtime we can quickly construct the list of all positional indexes to query B
by.
This is a toy example:
A = pd.DataFrame(range(3), columns=['dataA'], index=['A0', 'A1', 'A2'])
print A
dataA
A0 0
A1 1
A2 2
mindex = pd.MultiIndex.from_tuples([
('A0', 'B0'), ('A0', 'B1'), ('A1', 'B0'),
('A2', 'B0'), ('A2', 'B1'), ('A2', 'B3')])
B = pd.DataFrame(range(6), columns=['dataB'], index=mindex)
print B
dataB
A0 B0 0
B1 1
A1 B0 2
A2 B0 3
B1 4
B3 5
A = construct_position_indexes(A, B)
print A
dataA start_index end_index
A0 0 0 2
A1 1 2 3
A2 2 3 6
C = A.iloc[[0, 2], :]
print C
dataA start_index end_index
A0 0 0 2
A2 2 3 6
print get_slice(B, C)
dataB
A0 B0 0
B1 1
A2 B0 3
B1 4
B3 5