Select a multiple-key cross section from a DataFrame

前端 未结 4 1739
青春惊慌失措
青春惊慌失措 2020-12-05 18:38

I have a DataFrame \"df\" with (time,ticker) Multiindex and bid/ask/etc data columns:


                          tod    last     bid      ask      volume
    tim         


        
相关标签:
4条回答
  • 2020-12-05 18:55

    Convert to a panel, then indexing is direct

    In [20]: df = pd.DataFrame(dict(time = pd.Timestamp('20130102'), 
                                    A = np.random.rand(3), 
                     ticker=['SPY','SLV','GLD'])).set_index(['time','ticker'])
    
    In [21]: df
    Out[21]: 
                              A
    time       ticker          
    2013-01-02 SPY     0.347209
               SLV     0.034832
               GLD     0.280951
    
    In [22]: p = df.to_panel()
    
    In [23]: p
    Out[23]: 
    <class 'pandas.core.panel.Panel'>
    Dimensions: 1 (items) x 1 (major_axis) x 3 (minor_axis)
    Items axis: A to A
    Major_axis axis: 2013-01-02 00:00:00 to 2013-01-02 00:00:00
    Minor_axis axis: GLD to SPY
    
    In [24]: p.ix[:,:,['SPY','GLD']]
    Out[24]: 
    <class 'pandas.core.panel.Panel'>
    Dimensions: 1 (items) x 1 (major_axis) x 2 (minor_axis)
    Items axis: A to A
    Major_axis axis: 2013-01-02 00:00:00 to 2013-01-02 00:00:00
    Minor_axis axis: SPY to GLD
    
    0 讨论(0)
  • 2020-12-05 19:06

    I couldn't find a more direct way other than using select:

    >>> df
    
           last   tod
    A SPY     1  1600
      SLV     2  1600
      GLD     3  1600
    
    >>> df.select(lambda x: x[1] in ['SPY','GLD'])
    
           last   tod
    A SPY     1  1600
      GLD     3  1600
    
    0 讨论(0)
  • 2020-12-05 19:11

    There are better ways of doing this with more recent versions of Pandas (see Multi-indexing using slicers in the changelog for version 0.14):

    regression_df.loc[(slice(None), ['SPY', 'GLD']), :]
    

    This can be made more readable with the use of pd.IndexSlice:

    df.loc[pd.IndexSlice[:, ['SPY', 'GLD']], :]
    

    With the convention idx = pd.IndexSlice, this becomes

    df.loc[idx[:, ['SPY', 'GLD']], :]
    
    0 讨论(0)
  • 2020-12-05 19:18

    For what it is worth, I did the following:

    foo = pd.DataFrame(np.random.rand(12,3), 
                       index=pd.MultiIndex.from_product([['A','B','C','D'],['Green','Red','Blue']], 
                                                        names=['Letter','Color']),
                       columns=['X','Y','Z']).sort_index()
    
    foo.reset_index()\
       .loc[foo.reset_index().Color.isin({'Green','Red'})]\
       .set_index(foo.index.names)
    

    This approach is similar to select, but avoids iterating over all rows with a lambda.

    However, I compared this to the Panel approach, and it appears the Panel solution is faster (2.91 ms for index/loc vs 1.48 ms for to_panel/to_frame:

    foo.to_panel()[:,:,['Green','Red']].to_frame()
    

    Times:

    In [56]:
    %%timeit
    foo.reset_index().loc[foo.reset_index().Color.isin({'Green','Red'})].set_index(foo.index.names)
    100 loops, best of 3: 2.91 ms per loop
    
    In [57]:
    %%timeit
    foo2 = foo.reset_index()
    foo2.loc[foo2.Color.eq('Green') | foo2.Color.eq('Red')].set_index(foo.index.names)
    100 loops, best of 3: 2.85 ms per loop
    
    In [58]:
    %%timeit
    foo2 = foo.reset_index()
    foo2.loc[foo2.Color.ne('Blue')].set_index(foo.index.names)
    100 loops, best of 3: 2.37 ms per loop
    
    In [54]:
    %%timeit
    foo.to_panel()[:,:,['Green','Red']].to_frame()
    1000 loops, best of 3: 1.18 ms per loop
    

    UPDATE

    After revisiting this topic (again), I observed the following:

    In [100]:
    %%timeit
    foo2 = pd.DataFrame({k: foo.loc[k] for k in foo.index if k[1] in ['Green','Red']}).transpose()
    foo2.index.names = foo.index.names
    foo2.columns.names = foo2.columns.names
    100 loops, best of 3: 1.97 ms per loop
    
    In [101]:
    %%timeit
    foo2 = pd.DataFrame.from_dict({k: foo.loc[k] for k in foo.index if k[1] in ['Green','Red']}, orient='index')
    foo2.index.names = foo.index.names
    foo2.columns.names = foo2.columns.names
    100 loops, best of 3: 1.82 ms per loop
    

    If you don't care about preserving the original order and naming of the levels, you can use:

    %%timeit
    pd.concat({key: foo.xs(key, axis=0, level=1) for key in ['Green','Red']}, axis=0)
    1000 loops, best of 3: 1.31 ms per loop
    

    And if you are just selecting on the first level:

    %%timeit
    pd.concat({key: foo.loc[key] for key in ['A','B']}, axis=0, names=foo.index.names)
    1000 loops, best of 3: 1.12 ms per loop
    

    versus:

    %%timeit
    foo.to_panel()[:,['A','B'],:].to_frame()
    1000 loops, best of 3: 1.16 ms per loop
    

    Another Update

    If you sort the index of the example foo, many of the times above improve (times have been updated to reflect a pre-sorted index). However, when the index is sorted, you can use the solution described by user674155:

    %%timeit
    foo.loc[(slice(None), ['Blue','Red']),:]
    1000 loops, best of 3: 582 µs per loop
    

    This is the most efficient and intuitive in my opinion (the user doesn't need to understand panels and how they are created from frames).

    Note: even if the index has not yet been sorted, sorting the index of foo on the fly is comparable in performance to the to_panel option.

    0 讨论(0)
提交回复
热议问题