Select a multiple-key cross section from a DataFrame

匿名 (未验证) 提交于 2019-12-03 01:06:02

问题:

I have a DataFrame "df" with (time,ticker) Multiindex and bid/ask/etc data columns:

                            tod    last     bid      ask      volume     time        ticker                       2013-02-01  SPY       1600   149.70   150.14   150.17   1300                 SLV       1600   30.44    30.38    30.43    3892                 GLD       1600   161.20   161.19   161.21   3860  

I would like to select a second-level (level=1) cross section using multiple keys. Right now, I can do it using one key, i.e.

      df.xs('SPY', level=1)  

which gives me a timeseries of SPY. What is the best way to select a multi-key cross section, i.e. a combined cross-section of both SPY and GLD, something like:

      df.xs(['SPY', 'GLD'], level=1)  

?

回答1:

Convert to a panel, then indexing is direct

In [20]: df = pd.DataFrame(dict(time = pd.Timestamp('20130102'),                                  A = np.random.rand(3),                   ticker=['SPY','SLV','GLD'])).set_index(['time','ticker'])  In [21]: df Out[21]:                            A time       ticker           2013-01-02 SPY     0.347209            SLV     0.034832            GLD     0.280951  In [22]: p = df.to_panel()  In [23]: p Out[23]:   Dimensions: 1 (items) x 1 (major_axis) x 3 (minor_axis) Items axis: A to A Major_axis axis: 2013-01-02 00:00:00 to 2013-01-02 00:00:00 Minor_axis axis: GLD to SPY  In [24]: p.ix[:,:,['SPY','GLD']] Out[24]:   Dimensions: 1 (items) x 1 (major_axis) x 2 (minor_axis) Items axis: A to A Major_axis axis: 2013-01-02 00:00:00 to 2013-01-02 00:00:00 Minor_axis axis: SPY to GLD 


回答2:

I couldn't find a more direct way other than using select:

>>> df         last   tod A SPY     1  1600   SLV     2  1600   GLD     3  1600  >>> df.select(lambda x: x[1] in ['SPY','GLD'])         last   tod A SPY     1  1600   GLD     3  1600 


回答3:

There are better ways of doing this with more recent versions of Pandas:

regression_df.loc[(slice(None), ['SPY', 'GLD']), :] 

This approach requires that the index be lexicographically sorted (use df.sort_index()).



回答4:

For what it is worth, I did the following:

foo = pd.DataFrame(np.random.rand(12,3),                     index=pd.MultiIndex.from_product([['A','B','C','D'],['Green','Red','Blue']],                                                      names=['Letter','Color']),                    columns=['X','Y','Z']).sort_index()  foo.reset_index()\    .loc[foo.reset_index().Color.isin({'Green','Red'})]\    .set_index(foo.index.names) 

This approach is similar to select, but avoids iterating over all rows with a lambda.

However, I compared this to the Panel approach, and it appears the Panel solution is faster (2.91 ms for index/loc vs 1.48 ms for to_panel/to_frame:

foo.to_panel()[:,:,['Green','Red']].to_frame() 

Times:

In [56]: %%timeit foo.reset_index().loc[foo.reset_index().Color.isin({'Green','Red'})].set_index(foo.index.names) 100 loops, best of 3: 2.91 ms per loop  In [57]: %%timeit foo2 = foo.reset_index() foo2.loc[foo2.Color.eq('Green') | foo2.Color.eq('Red')].set_index(foo.index.names) 100 loops, best of 3: 2.85 ms per loop  In [58]: %%timeit foo2 = foo.reset_index() foo2.loc[foo2.Color.ne('Blue')].set_index(foo.index.names) 100 loops, best of 3: 2.37 ms per loop  In [54]: %%timeit foo.to_panel()[:,:,['Green','Red']].to_frame() 1000 loops, best of 3: 1.18 ms per loop 

UPDATE

After revisiting this topic (again), I observed the following:

In [100]: %%timeit foo2 = pd.DataFrame({k: foo.loc[k] for k in foo.index if k[1] in ['Green','Red']}).transpose() foo2.index.names = foo.index.names foo2.columns.names = foo2.columns.names 100 loops, best of 3: 1.97 ms per loop  In [101]: %%timeit foo2 = pd.DataFrame.from_dict({k: foo.loc[k] for k in foo.index if k[1] in ['Green','Red']}, orient='index') foo2.index.names = foo.index.names foo2.columns.names = foo2.columns.names 100 loops, best of 3: 1.82 ms per loop 

If you don't care about preserving the original order and naming of the levels, you can use:

%%timeit pd.concat({key: foo.xs(key, axis=0, level=1) for key in ['Green','Red']}, axis=0) 1000 loops, best of 3: 1.31 ms per loop 

And if you are just selecting on the first level:

%%timeit pd.concat({key: foo.loc[key] for key in ['A','B']}, axis=0, names=foo.index.names) 1000 loops, best of 3: 1.12 ms per loop 

versus:

%%timeit foo.to_panel()[:,['A','B'],:].to_frame() 1000 loops, best of 3: 1.16 ms per loop 

Another Update

If you sort the index of the example foo, many of the times above improve (times have been updated to reflect a pre-sorted index). However, when the index is sorted, you can use the solution described by user674155:

This is the most efficient and intuitive in my opinion (the user doesn't need to understand panels and how they are created from frames).

Note: even if the index has not yet been sorted, sorting the index of foo on the fly is comparable in performance to the to_panel option.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!