pandas: best way to select all columns whose names start with X

前端 未结 8 1159
别那么骄傲
别那么骄傲 2020-11-27 10:03

I have a DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame({\'foo.aa\': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
                   \'foo.fighters         


        
相关标签:
8条回答
  • 2020-11-27 10:22

    Just perform a list comprehension to create your columns:

    In [28]:
    
    filter_col = [col for col in df if col.startswith('foo')]
    filter_col
    Out[28]:
    ['foo.aa', 'foo.bars', 'foo.fighters', 'foo.fox', 'foo.manchu']
    In [29]:
    
    df[filter_col]
    Out[29]:
       foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
    0     1.0         0             0        2         NA
    1     2.1         0             1        4          0
    2     NaN         0           NaN        1          0
    3     4.7         0             0        0          0
    4     5.6         0             0        0          0
    5     6.8         1             0        5          0
    

    Another method is to create a series from the columns and use the vectorised str method startswith:

    In [33]:
    
    df[df.columns[pd.Series(df.columns).str.startswith('foo')]]
    Out[33]:
       foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
    0     1.0         0             0        2         NA
    1     2.1         0             1        4          0
    2     NaN         0           NaN        1          0
    3     4.7         0             0        0          0
    4     5.6         0             0        0          0
    5     6.8         1             0        5          0
    

    In order to achieve what you want you need to add the following to filter the values that don't meet your ==1 criteria:

    In [36]:
    
    df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]]==1]
    Out[36]:
       bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
    0      NaN       1       NaN           NaN      NaN        NaN     NaN
    1      NaN     NaN       NaN             1      NaN        NaN     NaN
    2      NaN     NaN       NaN           NaN        1        NaN     NaN
    3      NaN     NaN       NaN           NaN      NaN        NaN     NaN
    4      NaN     NaN       NaN           NaN      NaN        NaN     NaN
    5      NaN     NaN         1           NaN      NaN        NaN     NaN
    

    EDIT

    OK after seeing what you want the convoluted answer is this:

    In [72]:
    
    df.loc[df[df[df.columns[pd.Series(df.columns).str.startswith('foo')]] == 1].dropna(how='all', axis=0).index]
    Out[72]:
       bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
    0      5.0     1.0         0             0        2         NA      NA
    1      5.0     2.1         0             1        4          0       0
    2      6.0     NaN         0           NaN        1          0       1
    5      6.8     6.8         1             0        5          0       0
    
    0 讨论(0)
  • 2020-11-27 10:26

    Another option for the selection of the desired entries is to use map:

    df.loc[(df == 1).any(axis=1), df.columns.map(lambda x: x.startswith('foo'))]
    

    which gives you all the columns for rows that contain a 1:

       foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
    0     1.0         0             0        2         NA
    1     2.1         0             1        4          0
    2     NaN         0           NaN        1          0
    5     6.8         1             0        5          0
    

    The row selection is done by

    (df == 1).any(axis=1)
    

    as in @ajcr's answer which gives you:

    0     True
    1     True
    2     True
    3    False
    4    False
    5     True
    dtype: bool
    

    meaning that row 3 and 4 do not contain a 1 and won't be selected.

    The selection of the columns is done using Boolean indexing like this:

    df.columns.map(lambda x: x.startswith('foo'))
    

    In the example above this returns

    array([False,  True,  True,  True,  True,  True, False], dtype=bool)
    

    So, if a column does not start with foo, False is returned and the column is therefore not selected.

    If you just want to return all rows that contain a 1 - as your desired output suggests - you can simply do

    df.loc[(df == 1).any(axis=1)]
    

    which returns

       bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
    0      5.0     1.0         0             0        2         NA      NA
    1      5.0     2.1         0             1        4          0       0
    2      6.0     NaN         0           NaN        1          0       1
    5      6.8     6.8         1             0        5          0       0
    
    0 讨论(0)
  • 2020-11-27 10:31

    Now that pandas' indexes support string operations, arguably the simplest and best way to select columns beginning with 'foo' is just:

    df.loc[:, df.columns.str.startswith('foo')]
    

    Alternatively, you can filter column (or row) labels with df.filter(). To specify a regular expression to match the names beginning with foo.:

    >>> df.filter(regex=r'^foo\.', axis=1)
       foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
    0     1.0         0             0        2         NA
    1     2.1         0             1        4          0
    2     NaN         0           NaN        1          0
    3     4.7         0             0        0          0
    4     5.6         0             0        0          0
    5     6.8         1             0        5          0
    

    To select only the required rows (containing a 1) and the columns, you can use loc, selecting the columns using filter (or any other method) and the rows using any:

    >>> df.loc[(df == 1).any(axis=1), df.filter(regex=r'^foo\.', axis=1).columns]
       foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu
    0     1.0         0             0        2         NA
    1     2.1         0             1        4          0
    2     NaN         0           NaN        1          0
    5     6.8         1             0        5          0
    
    0 讨论(0)
  • 2020-11-27 10:31

    My solution. It may be slower on performance:

    a = pd.concat(df[df[c] == 1] for c in df.columns if c.startswith('foo'))
    a.sort_index()
    
    
       bar.baz  foo.aa  foo.bars  foo.fighters  foo.fox foo.manchu nas.foo
    0      5.0     1.0         0             0        2         NA      NA
    1      5.0     2.1         0             1        4          0       0
    2      6.0     NaN         0           NaN        1          0       1
    5      6.8     6.8         1             0        5          0       0
    
    0 讨论(0)
  • 2020-11-27 10:36

    Based on @EdChum's answer, you can try the following solution:

    df[df.columns[pd.Series(df.columns).str.contains("foo")]]
    

    This will be really helpful in case not all the columns you want to select start with foo. This method selects all the columns that contain the substring foo and it could be placed in at any point of a column's name.

    In essence, I replaced .startswith() with .contains().

    0 讨论(0)
  • 2020-11-27 10:37

    You can try the regex here to filter out the columns starting with "foo"

    df.filter(regex='^foo*')

    If you need to have the string foo in your column then

    df.filter(regex='foo*')

    would be appropriate.

    For the next step, you can use

    df[df.filter(regex='^foo*').values==1]

    to filter out the rows where one of the values of 'foo*' column is 1.

    0 讨论(0)
提交回复
热议问题