问题
Questions are at the end, in bold. But first, let's set up some data:
import numpy as np
import pandas as pd
from itertools import product
np.random.seed(1)
team_names = ['Yankees', 'Mets', 'Dodgers']
jersey_numbers = [35, 71, 84]
game_numbers = [1, 2]
observer_names = ['Bill', 'John', 'Ralph']
observation_types = ['Speed', 'Strength']
row_indices = list(product(team_names, jersey_numbers, game_numbers, observer_names, observation_types))
observation_values = np.random.randn(len(row_indices))
tns, jns, gns, ons, ots = zip(*row_indices)
data = pd.DataFrame({'team': tns, 'jersey': jns, 'game': gns, 'observer': ons, 'obstype': ots, 'value': observation_values})
data = data.set_index(['team', 'jersey', 'game', 'observer', 'obstype'])
data = data.unstack(['observer', 'obstype'])
data.columns = data.columns.droplevel(0)
this gives:
I want to pluck out a subset of this DataFrame for subsequent analysis. Say I wanted to slice out the rows where the jersey
number is 71. I don't really like the idea of using xs
to do this. When you do a cross section via xs
you lose the column you selected on. If I run:
data.xs(71, axis=0, level='jersey')
then I get back the right rows, but I lose the jersey
column.
Also, xs
doesn't seem like a great solution for the case where I want a few different values from the jersey
column. I think a much nicer solution is the one found here:
data[[j in [71, 84] for t, j, g in data.index]]
You could even filter on a combination of jerseys and teams:
data[[j in [71, 84] and t in ['Dodgers', 'Mets'] for t, j, g in data.index]]
Nice!
So the question: how can I do something similar for selecting a subset of columns. For example, say I want only the columns representing data from Ralph. How can I do that without using xs
? Or what if I wanted only the columns with observer in ['John', 'Ralph']
? Again, I'd really prefer a solution that keeps all the levels of the row and column indices in the result...just like the boolean indexing examples above.
I can do what I want, and even combine selections from both the row and column indices. But the only solution I've found involves some real gymnastics:
data[[j in [71, 84] and t in ['Dodgers', 'Mets'] for t, j, g in data.index]]\
.T[[obs in ['John', 'Ralph'] for obs, obstype in data.columns]].T
And thus the second question: is there a more compact way to do what I just did above?
回答1:
As of Pandas 0.18 (possibly earlier) you can easily slice multi-indexed DataFrames using pd.IndexSlice.
For your specific question, you can use the following to select by team, jersey, and game:
data.loc[pd.IndexSlice[:,[71, 84],:],:] #IndexSlice on the rows
IndexSlice needs just enough level information to be unambiguous so you can drop the trailing colon:
data.loc[pd.IndexSlice[:,[71, 84]],:]
Likewise, you can IndexSlice on columns:
data.loc[pd.IndexSlice[:,[71, 84]],pd.IndexSlice[['John', 'Ralph']]]
Which gives you the final DataFrame in your question.
回答2:
Here is one approach that uses slightly more built-in-feeling syntax. But it's still clunky as hell:
data.loc[
(data.index.get_level_values('jersey').isin([71, 84])
& data.index.get_level_values('team').isin(['Dodgers', 'Mets'])),
data.columns.get_level_values('observer').isin(['John', 'Ralph'])
]
So comparing:
def hackedsyntax():
return data[[j in [71, 84] and t in ['Dodgers', 'Mets'] for t, j, g in data.index]]\
.T[[obs in ['John', 'Ralph'] for obs, obstype in data.columns]].T
def uglybuiltinsyntax():
return data.loc[
(data.index.get_level_values('jersey').isin([71, 84])
& data.index.get_level_values('team').isin(['Dodgers', 'Mets'])),
data.columns.get_level_values('observer').isin(['John', 'Ralph'])
]
%timeit hackedsyntax()
%timeit uglybuiltinsyntax()
hackedsyntax() - uglybuiltinsyntax()
results:
1000 loops, best of 3: 395 µs per loop
1000 loops, best of 3: 409 µs per loop
Still hopeful there's a cleaner or more canonical way to do this.
回答3:
Note: Since Pandas v0.20, ix
accessor has been deprecated; use loc
or iloc
instead as appropriate.
If I've understood the question correctly, it's pretty simple:
To get the column for Ralph:
data.ix[:,"Ralph"]
to get it for two of them, pass in a list:
data.ix[:,["Ralph","John"]]
The ix operator is the power indexing operator. Remember that the first argument is rows, and then columns (as opposed to data[..][..] which is the other way around). The colon acts as a wildcard, so it returns all the rows in axis=0.
In general, to do a look up in a MultiIndex, you should pass in a tuple. e.g.
data.[:,("Ralph","Speed")]
But if you just pass in a single element, it will treat this as if you're passing in the first element of the tuple and then a wildcard.
Where it gets tricky is if you want to access columns that are not level 0 indices. For example, get all the columns for "speed". Then you'd need to get a bit more creative.. Use the get_level_values
method of index/column in combination with boolean indexing:
For example, this gets jersey 71 in the rows, and strength
in the columns:
data.ix[data.index.get_level_values("jersey") == 71 , \
data.columns.get_level_values("obstype") == "Strength"]
回答4:
Note that from what I understand, select
is slow. But another approach here would be:
data.select(lambda col: col[0] in ['John', 'Ralph'], axis=1)
you can also chain this with a selection against the rows:
data.select(lambda col: col[0] in ['John', 'Ralph'], axis=1) \
.select(lambda row: row[1] in [71, 84] and row[2] > 1, axis=0)
The big drawback here is that you have to know the index level number.
来源:https://stackoverflow.com/questions/20754746/using-boolean-indexing-for-row-and-column-multiindex-in-pandas