可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Having issue filtering my result dataframe with an or condition. I want my result df to extract all column var values that are above 0.25 and below -0.25. This logic below gives me an ambiguous truth value however it work when I split this filtering in two separate operations. What is happening here? not sure where to use the suggested a.empty(), a.bool(), a.item(),a.any() or a.all().
result = result[(result['var']>0.25) or (result['var']
回答1:
The or
and and
python statements require truth
-values. For pandas
these are considered ambiguous so you should use "bitwise" |
(or) or &
(and) operations:
result = result[(result['var']>0.25) | (result['var']
These are overloaded for these kind of datastructures to yield the element-wise or
(or and
).
Just to add some more explanation to this statement:
The exception is thrown when you want to get the bool
of a pandas.Series
:
>>> import pandas as pd >>> x = pd.Series([1]) >>> bool(x) ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
What you hit was a place where the operator implicitly converted the operands to bool
(you used or
but it also happens for and
, if
and while
):
>>> x or x ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). >>> x and x ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). >>> if x: ... print('fun') ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). >>> while x: ... print('fun') ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Besides these 4 statements there are several python functions that hide some bool
calls (like any
, all
, filter
, ...) these are normally not problematic with pandas.Series
but for completeness I wanted to mention these.
In your case the exception isn't really helpful, because it doesn't mention the right alternatives. For and
and or
you can use (if you want element-wise comparisons):
numpy.logical_or
:
>>> import numpy as np >>> np.logical_or(x, y)
or simply the |
operator:
>>> x | y
numpy.logical_and
:
>>> np.logical_and(x, y)
or simply the &
operator:
>>> x & y
If you're using the operators then make sure you set your parenthesis correctly because of the operator precedence.
There are several logical numpy functions which should work on pandas.Series
.
The alternatives mentioned in the Exception are more suited if you encountered it when doing if
or while
. I'll shortly explain each of these:
If you want to check if your Series is empty:
>>> x = pd.Series([]) >>> x.empty True >>> x = pd.Series([1]) >>> x.empty False
Python normally interprets the len
gth of containers (like list
, tuple
, ...) as truth-value if it has no explicit boolean interpretation. So if you want the python-like check, you could do: if x.size
or if not x.empty
instead of if x
.
If your Series
contains one and only one boolean value:
>>> x = pd.Series([100]) >>> (x > 50).bool() True >>> (x
If you want to check the first and only item of your Series (like .bool()
but works even for not boolean contents):
>>> x = pd.Series([100]) >>> x.item() 100
If you want to check if all or any item is not-zero, not-empty or not-False:
>>> x = pd.Series([0, 1, 2]) >>> x.all() # because one element is zero False >>> x.any() # because one (or more) elements are non-zero True
回答2:
For boolean logic, use &
and |
.
np.random.seed(0) df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC')) >>> df A B C 0 1.764052 0.400157 0.978738 1 2.240893 1.867558 -0.977278 2 0.950088 -0.151357 -0.103219 3 0.410599 0.144044 1.454274 4 0.761038 0.121675 0.443863 >>> df.loc[(df.C > 0.25) | (df.C
To see what is happening, you get a column of booleans for each comparison, e.g.
df.C > 0.25 0 True 1 False 2 False 3 True 4 True Name: C, dtype: bool
When you have multiple criteria, you will get multiple columns returned. This is why the the join logic is ambiguous. Using and
or or
treats each column separately, so you first need to reduce that column to a single boolean value. For example, to see if any value or all values in each of the columns is True.
# Any value in either column is True? (df.C > 0.25).any() or (df.C 0.25).all() or (df.C
One convoluted way to achieve the same thing is to zip all of these columns together, and perform the appropriate logic.
>>> df[[any([a, b]) for a, b in zip(df.C > 0.25, df.C
For more details, refer to Boolean Indexing in the docs.
回答3:
Or, alternatively, you could use Operator module. More detailed information is here Python docs
import operator import numpy as np import pandas as pd np.random.seed(0) df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC')) df.loc[operator.or_(df.C > 0.25, df.C
回答4:
This excellent answer explains very well what is happening and provides a solution. I would like to add another solution that might be suitable in similar cases: using the query
method:
result = result.query("(var > 0.25) or (var
See also http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-query.
A piece of warning: At least one situation where this is not straightforward is when column names happen to be python expressions. I had columns named WT_38hph_IP_2
, WT_38hph_input_2
and log2(WT_38hph_IP_2/WT_38hph_input_2)
and wanted to perform the following query: "(log2(WT_38hph_IP_2/WT_38hph_input_2) > 1) and (WT_38hph_IP_2 > 20)"
I obtained the following exception cascade:
KeyError: 'log2'
UndefinedVariableError: name 'log2' is not defined
ValueError: "log2" is not a supported function
I guess this happened because the query parser was trying to make something from the first two columns instead of identifying the expression with the name of the third column.
A possible workaround is proposed here.