问题
I have two multiindexed dataframes, one with two levels and one with three. The first two levels match in both dataframes. I would like to find all values from the first dataframe where the first two index levels match in the second dataframe. The second data frame does not have a third level.
The closest answer I have found is this: How to slice one MultiIndex DataFrame with the MultiIndex of another -- however the setup is slightly different and doesn't seem to translate to this case.
Consider the setup below
array_1 = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']),
np.array(['a', 'a','a', 'a','b','b','b','b' ])]
array_2 = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'three', 'one', 'two', 'two', 'one', 'two'])]
df_1 = pd.DataFrame(np.random.randn(8,4), index=array_1).sort_index()
print df_1
0 1 2 3
bar one a 1.092651 -0.325324 1.200960 -0.790002
two a -0.415263 1.006325 -0.077898 0.642134
baz one a -0.343707 0.474817 0.396702 -0.379066
two a 0.315192 -1.548431 -0.214253 -1.790330
foo one b 1.022050 -2.791862 0.172165 0.924701
two b 0.622062 -0.193056 -0.145019 0.763185
qux one b -1.241954 -1.270390 0.147623 -0.301092
two b 0.778022 1.450522 0.683487 -0.950528
df_2 = pd.DataFrame(np.random.randn(8,4), index=array_2).sort_index()
print df_2
0 1 2 3
bar one -0.354889 -1.283470 -0.977933 -0.601868
two -0.849186 -2.455453 0.790439 1.134282
baz one -0.143299 2.372440 -0.161744 0.919658
three -1.008426 -0.116167 -0.268608 0.840669
foo two -0.644028 0.447836 -0.576127 -0.891606
two -0.163497 -1.255801 -1.066442 0.624713
qux one -1.545989 -0.422028 -0.489222 -0.357954
two -1.202655 0.736047 -1.084002 0.732150
Now I query the second, dataframe, returning a subset of the original indexes
df_2_selection = df_2[(df_2 > 1).any(axis=1)]
print df_2_selection
0 1 2 3
bar two -0.849186 -2.455453 0.790439 1.134282
baz one -0.143299 2.372440 -0.161744 0.919658
I would like to find all the values in df_1 that match the indices found in df_2. The first two levels line up, but the third does not.
This problem is easy when the indices line up, and would be solved by something like df_1.loc[df_2_selection.index] #this works if indexes are the same
Also I can find thhe values which match one of the levels with something like
df_1[df_1.index.isin(df_2_selection.index.get_level_values(0),level = 0)]
but this does not solve the problem.
Chaining these statements together does not provide the desired functionality
df_1[(df_1.index.isin(df_2_selection.index.get_level_values(0),level = 0)) & (df_1.index.isin(df_2_selection.index.get_level_values(1),level = 1))]
I envision something along the lines of:
df_1_select = df_1[(df_1.index.isin(
df_2_selection.index.get_level_values([0,1]),level = [0,1])) #Doesnt Work
print df_1_select
0 1 2 3
bar two a -0.415263 1.006325 -0.077898 0.642134
baz one a -0.343707 0.474817 0.396702 -0.379066
I have tried many other methods, all of which have not worked exactly how I wanted. Thank you for your consideration.
EDIT:
This
df_1.loc[pd_idx[df_2_selection.index.get_level_values(0),df_2_selection.index.get_level_values(1),:],:]
Also does not work
I want only the rows where both levels match. Not where either level match.
EDIT 2: This solution was posted by someone who has since deleted it
id=[x+([x for x in df_1.index.levels[-1]]) for x in df_2_selection.index.values]
pd.concat([df_1.loc[x] for x in id])
Which indeed does work! However on large dataframes it is prohibitively slow. Any help with new methods / speedup is greatly appreciated.
回答1:
You can use reset_index()
and merge()
.
With df_2_selection
as:
0 1 2 3
foo two -0.530151 0.932007 -1.255259 2.441294
qux one 2.006270 1.087412 -0.840916 -1.225508
Merge with:
lvls = ["level_0","level_1"]
(df_1.reset_index()
.merge(df_2_selection.reset_index()[lvls], on=lvls)
.set_index(["level_0","level_1","level_2"])
.rename_axis([None]*3)
)
Output:
0 1 2 3
foo two b -0.112696 0.287421 -0.380692 -0.035471
qux one b 0.658227 0.632667 -0.193224 1.073132
Note: The rename_axis()
part just removes the level names, e.g. level_0
. It's purely cosmetic, and not necessary to perform the actual matching procedure.
回答2:
Try this:
pd.concat([
df_1.xs(key, drop_level=False)
for key in df_2_selection.index.values])
来源:https://stackoverflow.com/questions/47047140/pandas-slice-one-multiindex-dataframe-with-multiindex-of-another-when-some-leve