Selecting columns from pandas MultiIndex

前端 未结 7 965
隐瞒了意图╮
隐瞒了意图╮ 2020-11-29 22:36

I have DataFrame with MultiIndex columns that looks like this:

# sample data
col = pd.MultiIndex.from_arrays([[\'one\', \'one\', \'one\', \'two\', \'two\', \         


        
相关标签:
7条回答
  • 2020-11-29 22:54

    You can use either, loc or ix I'll show an example with loc:

    data.loc[:, [('one', 'a'), ('one', 'c'), ('two', 'a'), ('two', 'c')]]
    

    When you have a MultiIndexed DataFrame, and you want to filter out only some of the columns, you have to pass a list of tuples that match those columns. So the itertools approach was pretty much OK, but you don't have to create a new MultiIndex:

    data.loc[:, list(itertools.product(['one', 'two'], ['a', 'c']))]
    
    0 讨论(0)
  • 2020-11-29 22:54

    A slightly easier, to my mind, riff on Marc P.'s answer using slice:

    import pandas as pd
    col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'], ['a', 'b', 'c', 'a', 'b', 'c']])
    data = pd.DataFrame(np.random.randn(4, 6), columns=col)
    
    data.loc[:, pd.IndexSlice[:, ['a', 'c']]]
    
            one                 two          
              a         c         a         c
    0 -1.731008  0.718260 -1.088025 -1.489936
    1 -0.681189  1.055909  1.825839  0.149438
    2 -1.674623  0.769062  1.857317  0.756074
    3  0.408313  1.291998  0.833145 -0.471879
    

    As of pandas 0.21 or so, .select is deprecated in favour of .loc.

    0 讨论(0)
  • 2020-11-29 23:01

    It's not great, but maybe:

    >>> data
            one                           two                    
              a         b         c         a         b         c
    0 -0.927134 -1.204302  0.711426  0.854065 -0.608661  1.140052
    1 -0.690745  0.517359 -0.631856  0.178464 -0.312543 -0.418541
    2  1.086432  0.194193  0.808235 -0.418109  1.055057  1.886883
    3 -0.373822 -0.012812  1.329105  1.774723 -2.229428 -0.617690
    >>> data.loc[:,data.columns.get_level_values(1).isin({"a", "c"})]
            one                 two          
              a         c         a         c
    0 -0.927134  0.711426  0.854065  1.140052
    1 -0.690745 -0.631856  0.178464 -0.418541
    2  1.086432  0.808235 -0.418109  1.886883
    3 -0.373822  1.329105  1.774723 -0.617690
    

    would work?

    0 讨论(0)
  • 2020-11-29 23:12

    The most straightforward way is with .loc:

    >>> data.loc[:, (['one', 'two'], ['a', 'b'])]
    
    
       one       two     
         a    b    a    b
    0  0.4 -0.6 -0.7  0.9
    1  0.1  0.4  0.5 -0.3
    2  0.7 -1.6  0.7 -0.8
    3 -0.9  2.6  1.9  0.6
    

    Remember that [] and () have special meaning when dealing with a MultiIndex object:

    (...) a tuple is interpreted as one multi-level key

    (...) a list is used to specify several keys [on the same level]

    (...) a tuple of lists refer to several values within a level

    When we write (['one', 'two'], ['a', 'b']), the first list inside the tuple specifies all the values we want from the 1st level of the MultiIndex. The second list inside the tuple specifies all the values we want from the 2nd level of the MultiIndex.

    Edit 1: Another possibility is to use slice(None) to specify that we want anything from the first level (works similarly to slicing with : in lists). And then specify which columns from the second level we want.

    >>> data.loc[:, (slice(None), ["a", "b"])]
    
       one       two     
         a    b    a    b
    0  0.4 -0.6 -0.7  0.9
    1  0.1  0.4  0.5 -0.3
    2  0.7 -1.6  0.7 -0.8
    3 -0.9  2.6  1.9  0.6
    

    If the syntax slice(None) does appeal to you, then another possibility is to use pd.IndexSlice, which helps slicing frames with more elaborate indices.

    >>> data.loc[:, pd.IndexSlice[:, ["a", "b"]]]
    
       one       two     
         a    b    a    b
    0  0.4 -0.6 -0.7  0.9
    1  0.1  0.4  0.5 -0.3
    2  0.7 -1.6  0.7 -0.8
    3 -0.9  2.6  1.9  0.6
    

    When using pd.IndexSlice, we can use : as usual to slice the frame.

    Source: MultiIndex / Advanced Indexing, How to use slice(None)

    0 讨论(0)
  • 2020-11-29 23:16

    ix and select are deprecated!

    The use of pd.IndexSlice makes loc a more preferable option to ix and select.


    DataFrame.loc with pd.IndexSlice

    # Setup
    col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                    ['a', 'b', 'c', 'a', 'b', 'c']])
    data = pd.DataFrame('x', index=range(4), columns=col)
    data
    
      one       two      
        a  b  c   a  b  c
    0   x  x  x   x  x  x
    1   x  x  x   x  x  x
    2   x  x  x   x  x  x
    3   x  x  x   x  x  x
    

    data.loc[:, pd.IndexSlice[:, ['a', 'c']]]
    
      one    two   
        a  c   a  c
    0   x  x   x  x
    1   x  x   x  x
    2   x  x   x  x
    3   x  x   x  x
    

    You can alternatively an axis parameter to loc to make it explicit which axis you're indexing from:

    data.loc(axis=1)[pd.IndexSlice[:, ['a', 'c']]]
    
      one    two   
        a  c   a  c
    0   x  x   x  x
    1   x  x   x  x
    2   x  x   x  x
    3   x  x   x  x
    

    MultiIndex.get_level_values

    Calling data.columns.get_level_values to filter with loc is another option:

    data.loc[:, data.columns.get_level_values(1).isin(['a', 'c'])]
    
      one    two   
        a  c   a  c
    0   x  x   x  x
    1   x  x   x  x
    2   x  x   x  x
    3   x  x   x  x
    

    This can naturally allow for filtering on any conditional expression on a single level. Here's a random example with lexicographical filtering:

    data.loc[:, data.columns.get_level_values(1) > 'b']
    
      one two
        c   c
    0   x   x
    1   x   x
    2   x   x
    3   x   x
    

    More information on slicing and filtering MultiIndexes can be found at Select rows in pandas MultiIndex DataFrame.

    0 讨论(0)
  • 2020-11-29 23:19

    I think there is a much better way (now), which is why I bother pulling this question (which was the top google result) out of the shadows:

    data.select(lambda x: x[1] in ['a', 'b'], axis=1)
    

    gives your expected output in a quick and clean one-liner:

            one                 two          
              a         b         a         b
    0 -0.341326  0.374504  0.534559  0.429019
    1  0.272518  0.116542 -0.085850 -0.330562
    2  1.982431 -0.420668 -0.444052  1.049747
    3  0.162984 -0.898307  1.762208 -0.101360
    

    It is mostly self-explaining, the [1] refers to the level.

    0 讨论(0)
提交回复
热议问题