问题
This is essentially a rehashing of the content of my answer here.
I came across some weird behaviour when trying to solve this question, using pd.notnull
.
Consider
x = ('A4', nan)
I want to check which of these items are null. Using np.isnan
directly will throw a TypeError (but I've figured out how to solve that).
Using pd.notnull
does not work.
>>> pd.notnull(x)
True
It treats the tuple as a single value (rather than an iterable of values). Furthermore, converting this to a list and then testing also gives an incorrect answer.
>>> pd.notnull(list(x))
array([ True, True])
Since the second value is nan
, the result I'm looking for should be [True, False]
. It finally works when you pre-convert to a Series:
>>> pd.Series(x).notnull()
0 True
1 False
dtype: bool
So, the solution is to Series-ify it and then test the values.
Along similar lines, another (admittedly roundabout) solution is to pre-convert to an object
dtype numpy array, and pd.notnull
or np.isnan
will work directly:
>>> pd.notnull(np.array(x, dtype=object))
Out[151]: array([True, False])
I imagine that pd.notnull
directly converts x
to a string array under the covers, rendering the NaN as a string "nan", so it is no longer a "null" value.
Is pd.notnull
doing the same thing here? Or is there something else going on under the covers that I should be aware of?
Notes
In [156]: pd.__version__
Out[156]: '0.22.0'
回答1:
Here is the issue related to this behavior: https://github.com/pandas-dev/pandas/issues/20675.
In short, if argument passed to notnull
is of type list
, internally it is converted to np.array
with np.asarray
method. This bug occured, because, if no dtype
specified, numpy converts np.nan
to string
(which is not recognized by pd.isnull
as null value):
a = ['A4', np.nan]
np.asarray(a)
# array(['A4', 'nan'], dtype='<U3')
This problem was fixed in version 0.23.0, by calling np.asarray
with dtype=object
.
来源:https://stackoverflow.com/questions/51035790/weird-null-checking-behaviour-by-pd-notnull