I have a DataFrame "df" with (time,ticker) Multiindex and bid/ask/etc data columns:
tod last bid ask volume time ticker 2013-02-01 SPY 1600 149.70 150.14 150.17 1300 SLV 1600 30.44 30.38 30.43 3892 GLD 1600 161.20 161.19 161.21 3860
I would like to select a second-level (level=1) cross section using multiple keys. Right now, I can do it using one key, i.e.
df.xs('SPY', level=1)
which gives me a timeseries of SPY. What is the best way to select a multi-key cross section, i.e. a combined cross-section of both SPY and GLD, something like:
df.xs(['SPY', 'GLD'], level=1)
?
Convert to a panel, then indexing is direct
In [20]: df = pd.DataFrame(dict(time = pd.Timestamp('20130102'), A = np.random.rand(3), ticker=['SPY','SLV','GLD'])).set_index(['time','ticker']) In [21]: df Out[21]: A time ticker 2013-01-02 SPY 0.347209 SLV 0.034832 GLD 0.280951 In [22]: p = df.to_panel() In [23]: p Out[23]: Dimensions: 1 (items) x 1 (major_axis) x 3 (minor_axis) Items axis: A to A Major_axis axis: 2013-01-02 00:00:00 to 2013-01-02 00:00:00 Minor_axis axis: GLD to SPY In [24]: p.ix[:,:,['SPY','GLD']] Out[24]: Dimensions: 1 (items) x 1 (major_axis) x 2 (minor_axis) Items axis: A to A Major_axis axis: 2013-01-02 00:00:00 to 2013-01-02 00:00:00 Minor_axis axis: SPY to GLD
I couldn't find a more direct way other than using select
:
>>> df last tod A SPY 1 1600 SLV 2 1600 GLD 3 1600 >>> df.select(lambda x: x[1] in ['SPY','GLD']) last tod A SPY 1 1600 GLD 3 1600
There are better ways of doing this with more recent versions of Pandas:
regression_df.loc[(slice(None), ['SPY', 'GLD']), :]
This approach requires that the index be lexicographically sorted (use df.sort_index()
).
For what it is worth, I did the following:
foo = pd.DataFrame(np.random.rand(12,3), index=pd.MultiIndex.from_product([['A','B','C','D'],['Green','Red','Blue']], names=['Letter','Color']), columns=['X','Y','Z']).sort_index() foo.reset_index()\ .loc[foo.reset_index().Color.isin({'Green','Red'})]\ .set_index(foo.index.names)
This approach is similar to select, but avoids iterating over all rows with a lambda.
However, I compared this to the Panel approach, and it appears the Panel solution is faster (2.91 ms for index/loc vs 1.48 ms for to_panel/to_frame:
foo.to_panel()[:,:,['Green','Red']].to_frame()
Times:
In [56]: %%timeit foo.reset_index().loc[foo.reset_index().Color.isin({'Green','Red'})].set_index(foo.index.names) 100 loops, best of 3: 2.91 ms per loop In [57]: %%timeit foo2 = foo.reset_index() foo2.loc[foo2.Color.eq('Green') | foo2.Color.eq('Red')].set_index(foo.index.names) 100 loops, best of 3: 2.85 ms per loop In [58]: %%timeit foo2 = foo.reset_index() foo2.loc[foo2.Color.ne('Blue')].set_index(foo.index.names) 100 loops, best of 3: 2.37 ms per loop In [54]: %%timeit foo.to_panel()[:,:,['Green','Red']].to_frame() 1000 loops, best of 3: 1.18 ms per loop
UPDATE
After revisiting this topic (again), I observed the following:
In [100]: %%timeit foo2 = pd.DataFrame({k: foo.loc[k] for k in foo.index if k[1] in ['Green','Red']}).transpose() foo2.index.names = foo.index.names foo2.columns.names = foo2.columns.names 100 loops, best of 3: 1.97 ms per loop In [101]: %%timeit foo2 = pd.DataFrame.from_dict({k: foo.loc[k] for k in foo.index if k[1] in ['Green','Red']}, orient='index') foo2.index.names = foo.index.names foo2.columns.names = foo2.columns.names 100 loops, best of 3: 1.82 ms per loop
If you don't care about preserving the original order and naming of the levels, you can use:
%%timeit pd.concat({key: foo.xs(key, axis=0, level=1) for key in ['Green','Red']}, axis=0) 1000 loops, best of 3: 1.31 ms per loop
And if you are just selecting on the first level:
%%timeit pd.concat({key: foo.loc[key] for key in ['A','B']}, axis=0, names=foo.index.names) 1000 loops, best of 3: 1.12 ms per loop
versus:
%%timeit foo.to_panel()[:,['A','B'],:].to_frame() 1000 loops, best of 3: 1.16 ms per loop
Another Update
If you sort the index of the example foo
, many of the times above improve (times have been updated to reflect a pre-sorted index). However, when the index is sorted, you can use the solution described by user674155:
This is the most efficient and intuitive in my opinion (the user doesn't need to understand panels and how they are created from frames).
Note: even if the index has not yet been sorted, sorting the index of foo
on the fly is comparable in performance to the to_panel
option.