Replace NaN with empty list in a pandas dataframe

后端未结

关注

 3  521

I\'m trying to replace some NaN values in my data with an empty list []. However the list is represented as a str and doesn\'t allow me to properly apply the len() function.

相关标签:

3条回答

慢半拍i

2021-01-04 02:34
You can also use a list comprehension for this:
```
d['x'] = [ [] if x is np.NaN else x for x in d['x'] ]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

余生分开走

2021-01-04 02:47

This works using isnull and loc to mask the series:

In [90]:
d.loc[d.isnull()] = d.loc[d.isnull()].apply(lambda x: [])
d

Out[90]:
0    [1, 2, 3]
1       [1, 2]
2           []
3           []
dtype: object

In [91]:
d.apply(len)

Out[91]:
0    3
1    2
2    0
3    0
dtype: int64

You have to do this using apply in order for the list object to not be interpreted as an array to assign back to the df which will try to align the shape back to the original series

EDIT

Using your updated sample the following works:

In [100]:
d.loc[d['x'].isnull(),['x']] = d.loc[d['x'].isnull(),'x'].apply(lambda x: [])
d

Out[100]:
           x  y
0  [1, 2, 3]  1
1     [1, 2]  2
2         []  3
3         []  4

In [102]:    
d['x'].apply(len)

Out[102]:
0    3
1    2
2    0
3    0
Name: x, dtype: int64

0 讨论(0)

故里飘歌

2021-01-04 02:48

To extend the accepted answer, apply calls can be particularly expensive - the same task can be accomplished without it by constructing a numpy array from scratch.

isna = df['x'].isna()
df.loc[isna, 'x'] = pd.Series([[]] * isna.sum()).values

A quick timing comparison:

def empty_assign_1(s):
    s.isna().apply(lambda x: [])

def empty_assign_2(s):
    pd.Series([[]] * s.isna().sum()).values

series = pd.Series(np.random.choice([1, 2, np.nan], 1000000))

%timeit empty_assign_1(series)
>>> 172 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit empty_assign_2(series)
>>> 19.5 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Nearly 10 times faster!

0 讨论(0)