Logical operators for boolean indexing in Pandas

前端 未结 3 916
清歌不尽
清歌不尽 2020-11-21 23:13

I\'m working with boolean index in Pandas. The question is why the statement:

a[(a[\'some_column\']==some_number) & (a[\'some_other_column\']==some_other         


        
3条回答
  •  再見小時候
    2020-11-21 23:32

    TLDR; Logical Operators in Pandas are &, | and ~, and parentheses (...) is important!

    Python's and, or and not logical operators are designed to work with scalars. So Pandas had to do one better and override the bitwise operators to achieve vectorized (element-wise) version of this functionality.

    So the following in python (exp1 and exp2 are expressions which evaluate to a boolean result)...

    exp1 and exp2              # Logical AND
    exp1 or exp2               # Logical OR
    not exp1                   # Logical NOT
    

    ...will translate to...

    exp1 & exp2                # Element-wise logical AND
    exp1 | exp2                # Element-wise logical OR
    ~exp1                      # Element-wise logical NOT
    

    for pandas.

    If in the process of performing logical operation you get a ValueError, then you need to use parentheses for grouping:

    (exp1) op (exp2)
    

    For example,

    (df['col1'] == x) & (df['col2'] == y) 
    

    And so on.


    Boolean Indexing: A common operation is to compute boolean masks through logical conditions to filter the data. Pandas provides three operators: & for logical AND, | for logical OR, and ~ for logical NOT.

    Consider the following setup:

    np.random.seed(0)
    df = pd.DataFrame(np.random.choice(10, (5, 3)), columns=list('ABC'))
    df
    
       A  B  C
    0  5  0  3
    1  3  7  9
    2  3  5  2
    3  4  7  6
    4  8  8  1
    

    Logical AND

    For df above, say you'd like to return all rows where A < 5 and B > 5. This is done by computing masks for each condition separately, and ANDing them.

    Overloaded Bitwise & Operator
    Before continuing, please take note of this particular excerpt of the docs, which state

    Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses, since by default Python will evaluate an expression such as df.A > 2 & df.B < 3 as df.A > (2 & df.B) < 3, while the desired evaluation order is (df.A > 2) & (df.B < 3).

    So, with this in mind, element wise logical AND can be implemented with the bitwise operator &:

    df['A'] < 5
    
    0    False
    1     True
    2     True
    3     True
    4    False
    Name: A, dtype: bool
    
    df['B'] > 5
    
    0    False
    1     True
    2    False
    3     True
    4     True
    Name: B, dtype: bool
    

    (df['A'] < 5) & (df['B'] > 5)
    
    0    False
    1     True
    2    False
    3     True
    4    False
    dtype: bool
    

    And the subsequent filtering step is simply,

    df[(df['A'] < 5) & (df['B'] > 5)]
    
       A  B  C
    1  3  7  9
    3  4  7  6
    

    The parentheses are used to override the default precedence order of bitwise operators, which have higher precedence over the conditional operators < and >. See the section of Operator Precedence in the python docs.

    If you do not use parentheses, the expression is evaluated incorrectly. For example, if you accidentally attempt something such as

    df['A'] < 5 & df['B'] > 5
    

    It is parsed as

    df['A'] < (5 & df['B']) > 5
    

    Which becomes,

    df['A'] < something_you_dont_want > 5
    

    Which becomes (see the python docs on chained operator comparison),

    (df['A'] < something_you_dont_want) and (something_you_dont_want > 5)
    

    Which becomes,

    # Both operands are Series...
    something_else_you_dont_want1 and something_else_you_dont_want2

    Which throws

    ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
    

    So, don't make that mistake!1

    Avoiding Parentheses Grouping
    The fix is actually quite simple. Most operators have a corresponding bound method for DataFrames. If the individual masks are built up using functions instead of conditional operators, you will no longer need to group by parens to specify evaluation order:

    df['A'].lt(5)
    
    0     True
    1     True
    2     True
    3     True
    4    False
    Name: A, dtype: bool
    
    df['B'].gt(5)
    
    0    False
    1     True
    2    False
    3     True
    4     True
    Name: B, dtype: bool
    

    df['A'].lt(5) & df['B'].gt(5)
    
    0    False
    1     True
    2    False
    3     True
    4    False
    dtype: bool
    

    See the section on Flexible Comparisons.. To summarise, we have

    ╒════╤════════════╤════════════╕
    │    │ Operator   │ Function   │
    ╞════╪════════════╪════════════╡
    │  0 │ >          │ gt         │
    ├────┼────────────┼────────────┤
    │  1 │ >=         │ ge         │
    ├────┼────────────┼────────────┤
    │  2 │ <          │ lt         │
    ├────┼────────────┼────────────┤
    │  3 │ <=         │ le         │
    ├────┼────────────┼────────────┤
    │  4 │ ==         │ eq         │
    ├────┼────────────┼────────────┤
    │  5 │ !=         │ ne         │
    ╘════╧════════════╧════════════╛
    

    Another option for avoiding parentheses is to use DataFrame.query (or eval):

    df.query('A < 5 and B > 5')
    
       A  B  C
    1  3  7  9
    3  4  7  6
    

    I have extensively documented query and eval in Dynamic Expression Evaluation in pandas using pd.eval().

    operator.and_
    Allows you to perform this operation in a functional manner. Internally calls Series.__and__ which corresponds to the bitwise operator.

    import operator 
    
    operator.and_(df['A'] < 5, df['B'] > 5)
    # Same as,
    # (df['A'] < 5).__and__(df['B'] > 5) 
    
    0    False
    1     True
    2    False
    3     True
    4    False
    dtype: bool
    
    df[operator.and_(df['A'] < 5, df['B'] > 5)]
    
       A  B  C
    1  3  7  9
    3  4  7  6
    

    You won't usually need this, but it is useful to know.

    Generalizing: np.logical_and (and logical_and.reduce)
    Another alternative is using np.logical_and, which also does not need parentheses grouping:

    np.logical_and(df['A'] < 5, df['B'] > 5)
    
    0    False
    1     True
    2    False
    3     True
    4    False
    Name: A, dtype: bool
    
    df[np.logical_and(df['A'] < 5, df['B'] > 5)]
    
       A  B  C
    1  3  7  9
    3  4  7  6
    

    np.logical_and is a ufunc (Universal Functions), and most ufuncs have a reduce method. This means it is easier to generalise with logical_and if you have multiple masks to AND. For example, to AND masks m1 and m2 and m3 with &, you would have to do

    m1 & m2 & m3
    

    However, an easier option is

    np.logical_and.reduce([m1, m2, m3])
    

    This is powerful, because it lets you build on top of this with more complex logic (for example, dynamically generating masks in a list comprehension and adding all of them):

    import operator
    
    cols = ['A', 'B']
    ops = [np.less, np.greater]
    values = [5, 5]
    
    m = np.logical_and.reduce([op(df[c], v) for op, c, v in zip(ops, cols, values)])
    m 
    # array([False,  True, False,  True, False])
    
    df[m]
       A  B  C
    1  3  7  9
    3  4  7  6
    

    1 - I know I'm harping on this point, but please bear with me. This is a very, very common beginner's mistake, and must be explained very thoroughly.


    Logical OR

    For the df above, say you'd like to return all rows where A == 3 or B == 7.

    Overloaded Bitwise |

    df['A'] == 3
    
    0    False
    1     True
    2     True
    3    False
    4    False
    Name: A, dtype: bool
    
    df['B'] == 7
    
    0    False
    1     True
    2    False
    3     True
    4    False
    Name: B, dtype: bool
    

    (df['A'] == 3) | (df['B'] == 7)
    
    0    False
    1     True
    2     True
    3     True
    4    False
    dtype: bool
    
    df[(df['A'] == 3) | (df['B'] == 7)]
    
       A  B  C
    1  3  7  9
    2  3  5  2
    3  4  7  6
    

    If you haven't yet, please also read the section on Logical AND above, all caveats apply here.

    Alternatively, this operation can be specified with

    df[df['A'].eq(3) | df['B'].eq(7)]
    
       A  B  C
    1  3  7  9
    2  3  5  2
    3  4  7  6
    

    operator.or_
    Calls Series.__or__ under the hood.

    operator.or_(df['A'] == 3, df['B'] == 7)
    # Same as,
    # (df['A'] == 3).__or__(df['B'] == 7)
    
    0    False
    1     True
    2     True
    3     True
    4    False
    dtype: bool
    
    df[operator.or_(df['A'] == 3, df['B'] == 7)]
    
       A  B  C
    1  3  7  9
    2  3  5  2
    3  4  7  6
    

    np.logical_or
    For two conditions, use logical_or:

    np.logical_or(df['A'] == 3, df['B'] == 7)
    
    0    False
    1     True
    2     True
    3     True
    4    False
    Name: A, dtype: bool
    
    df[np.logical_or(df['A'] == 3, df['B'] == 7)]
    
       A  B  C
    1  3  7  9
    2  3  5  2
    3  4  7  6
    

    For multiple masks, use logical_or.reduce:

    np.logical_or.reduce([df['A'] == 3, df['B'] == 7])
    # array([False,  True,  True,  True, False])
    
    df[np.logical_or.reduce([df['A'] == 3, df['B'] == 7])]
    
       A  B  C
    1  3  7  9
    2  3  5  2
    3  4  7  6
    

    Logical NOT

    Given a mask, such as

    mask = pd.Series([True, True, False])
    

    If you need to invert every boolean value (so that the end result is [False, False, True]), then you can use any of the methods below.

    Bitwise ~

    ~mask
    
    0    False
    1    False
    2     True
    dtype: bool
    

    Again, expressions need to be parenthesised.

    ~(df['A'] == 3)
    
    0     True
    1    False
    2    False
    3     True
    4     True
    Name: A, dtype: bool
    

    This internally calls

    mask.__invert__()
    
    0    False
    1    False
    2     True
    dtype: bool
    

    But don't use it directly.

    operator.inv
    Internally calls __invert__ on the Series.

    operator.inv(mask)
    
    0    False
    1    False
    2     True
    dtype: bool
    

    np.logical_not
    This is the numpy variant.

    np.logical_not(mask)
    
    0    False
    1    False
    2     True
    dtype: bool
    

    Note, np.logical_and can be substituted for np.bitwise_and, logical_or with bitwise_or, and logical_not with invert.

提交回复
热议问题