pandas: filter rows of DataFrame with operator chaining

前端 未结 14 2222
悲哀的现实
悲哀的现实 2020-11-22 16:46

Most operations in pandas can be accomplished with operator chaining (groupby, aggregate, apply, etc), but the only way I

相关标签:
14条回答
  • 2020-11-22 17:16

    I offer this for additional examples. This is the same answer as https://stackoverflow.com/a/28159296/

    I'll add other edits to make this post more useful.

    pandas.DataFrame.query
    query was made for exactly this purpose. Consider the dataframe df

    import pandas as pd
    import numpy as np
    
    np.random.seed([3,1415])
    df = pd.DataFrame(
        np.random.randint(10, size=(10, 5)),
        columns=list('ABCDE')
    )
    
    df
    
       A  B  C  D  E
    0  0  2  7  3  8
    1  7  0  6  8  6
    2  0  2  0  4  9
    3  7  3  2  4  3
    4  3  6  7  7  4
    5  5  3  7  5  9
    6  8  7  6  4  7
    7  6  2  6  6  5
    8  2  8  7  5  8
    9  4  7  6  1  5
    

    Let's use query to filter all rows where D > B

    df.query('D > B')
    
       A  B  C  D  E
    0  0  2  7  3  8
    1  7  0  6  8  6
    2  0  2  0  4  9
    3  7  3  2  4  3
    4  3  6  7  7  4
    5  5  3  7  5  9
    7  6  2  6  6  5
    

    Which we chain

    df.query('D > B').query('C > B')
    # equivalent to
    # df.query('D > B and C > B')
    # but defeats the purpose of demonstrating chaining
    
       A  B  C  D  E
    0  0  2  7  3  8
    1  7  0  6  8  6
    4  3  6  7  7  4
    5  5  3  7  5  9
    7  6  2  6  6  5
    
    0 讨论(0)
  • 2020-11-22 17:17

    Just want to add a demonstration using loc to filter not only by rows but also by columns and some merits to the chained operation.

    The code below can filter the rows by value.

    df_filtered = df.loc[df['column'] == value]
    

    By modifying it a bit you can filter the columns as well.

    df_filtered = df.loc[df['column'] == value, ['year', 'column']]
    

    So why do we want a chained method? The answer is that it is simple to read if you have many operations. For example,

    res =  df\
        .loc[df['station']=='USA', ['TEMP', 'RF']]\
        .groupby('year')\
        .agg(np.nanmean)
    
    0 讨论(0)
  • 2020-11-22 17:17

    If you set your columns to search as indexes, then you can use DataFrame.xs() to take a cross section. This is not as versatile as the query answers, but it might be useful in some situations.

    import pandas as pd
    import numpy as np
    
    np.random.seed([3,1415])
    df = pd.DataFrame(
        np.random.randint(3, size=(10, 5)),
        columns=list('ABCDE')
    )
    
    df
    # Out[55]: 
    #    A  B  C  D  E
    # 0  0  2  2  2  2
    # 1  1  1  2  0  2
    # 2  0  2  0  0  2
    # 3  0  2  2  0  1
    # 4  0  1  1  2  0
    # 5  0  0  0  1  2
    # 6  1  0  1  1  1
    # 7  0  0  2  0  2
    # 8  2  2  2  2  2
    # 9  1  2  0  2  1
    
    df.set_index(['A', 'D']).xs([0, 2]).reset_index()
    # Out[57]: 
    #    A  D  B  C  E
    # 0  0  2  2  2  2
    # 1  0  2  1  1  0
    
    0 讨论(0)
  • 2020-11-22 17:22

    I had the same question except that I wanted to combine the criteria into an OR condition. The format given by Wouter Overmeire combines the criteria into an AND condition such that both must be satisfied:

    In [96]: df
    Out[96]:
       A  B  C  D
    a  1  4  9  1
    b  4  5  0  2
    c  5  5  1  0
    d  1  3  9  6
    
    In [99]: df[(df.A == 1) & (df.D == 6)]
    Out[99]:
       A  B  C  D
    d  1  3  9  6
    

    But I found that, if you wrap each condition in (... == True) and join the criteria with a pipe, the criteria are combined in an OR condition, satisfied whenever either of them is true:

    df[((df.A==1) == True) | ((df.D==6) == True)]
    
    0 讨论(0)
  • 2020-11-22 17:22

    This solution is more hackish in terms of implementation, but I find it much cleaner in terms of usage, and it is certainly more general than the others proposed.

    https://github.com/toobaz/generic_utils/blob/master/generic_utils/pandas/where.py

    You don't need to download the entire repo: saving the file and doing

    from where import where as W
    

    should suffice. Then you use it like this:

    df = pd.DataFrame([[1, 2, True],
                       [3, 4, False], 
                       [5, 7, True]],
                      index=range(3), columns=['a', 'b', 'c'])
    # On specific column:
    print(df.loc[W['a'] > 2])
    print(df.loc[-W['a'] == W['b']])
    print(df.loc[~W['c']])
    # On entire - or subset of a - DataFrame:
    print(df.loc[W.sum(axis=1) > 3])
    print(df.loc[W[['a', 'b']].diff(axis=1)['b'] > 1])
    

    A slightly less stupid usage example:

    data = pd.read_csv('ugly_db.csv').loc[~(W == '$null$').any(axis=1)]
    

    By the way: even in the case in which you are just using boolean cols,

    df.loc[W['cond1']].loc[W['cond2']]
    

    can be much more efficient than

    df.loc[W['cond1'] & W['cond2']]
    

    because it evaluates cond2 only where cond1 is True.

    DISCLAIMER: I first gave this answer elsewhere because I hadn't seen this.

    0 讨论(0)
  • 2020-11-22 17:25

    The answer from @lodagro is great. I would extend it by generalizing the mask function as:

    def mask(df, f):
      return df[f(df)]
    

    Then you can do stuff like:

    df.mask(lambda x: x[0] < 0).mask(lambda x: x[1] > 0)
    
    0 讨论(0)
提交回复
热议问题