I\'m trying to replace some NaN values in my data with an empty list []. However the list is represented as a str and doesn\'t allow me to properly apply the len() function.
You can also use a list comprehension for this:
d['x'] = [ [] if x is np.NaN else x for x in d['x'] ]
This works using isnull
and loc
to mask the series:
In [90]:
d.loc[d.isnull()] = d.loc[d.isnull()].apply(lambda x: [])
d
Out[90]:
0 [1, 2, 3]
1 [1, 2]
2 []
3 []
dtype: object
In [91]:
d.apply(len)
Out[91]:
0 3
1 2
2 0
3 0
dtype: int64
You have to do this using apply
in order for the list object to not be interpreted as an array to assign back to the df which will try to align the shape back to the original series
EDIT
Using your updated sample the following works:
In [100]:
d.loc[d['x'].isnull(),['x']] = d.loc[d['x'].isnull(),'x'].apply(lambda x: [])
d
Out[100]:
x y
0 [1, 2, 3] 1
1 [1, 2] 2
2 [] 3
3 [] 4
In [102]:
d['x'].apply(len)
Out[102]:
0 3
1 2
2 0
3 0
Name: x, dtype: int64
To extend the accepted answer, apply calls can be particularly expensive - the same task can be accomplished without it by constructing a numpy array from scratch.
isna = df['x'].isna()
df.loc[isna, 'x'] = pd.Series([[]] * isna.sum()).values
A quick timing comparison:
def empty_assign_1(s):
s.isna().apply(lambda x: [])
def empty_assign_2(s):
pd.Series([[]] * s.isna().sum()).values
series = pd.Series(np.random.choice([1, 2, np.nan], 1000000))
%timeit empty_assign_1(series)
>>> 172 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit empty_assign_2(series)
>>> 19.5 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Nearly 10 times faster!