Replace NaN with empty list in a pandas dataframe

后端 未结 3 513
后悔当初
后悔当初 2021-01-04 02:11

I\'m trying to replace some NaN values in my data with an empty list []. However the list is represented as a str and doesn\'t allow me to properly apply the len() function.

相关标签:
3条回答
  • 2021-01-04 02:34

    You can also use a list comprehension for this:

    d['x'] = [ [] if x is np.NaN else x for x in d['x'] ]
    
    0 讨论(0)
  • 2021-01-04 02:47

    This works using isnull and loc to mask the series:

    In [90]:
    d.loc[d.isnull()] = d.loc[d.isnull()].apply(lambda x: [])
    d
    
    Out[90]:
    0    [1, 2, 3]
    1       [1, 2]
    2           []
    3           []
    dtype: object
    
    In [91]:
    d.apply(len)
    
    Out[91]:
    0    3
    1    2
    2    0
    3    0
    dtype: int64
    

    You have to do this using apply in order for the list object to not be interpreted as an array to assign back to the df which will try to align the shape back to the original series

    EDIT

    Using your updated sample the following works:

    In [100]:
    d.loc[d['x'].isnull(),['x']] = d.loc[d['x'].isnull(),'x'].apply(lambda x: [])
    d
    
    Out[100]:
               x  y
    0  [1, 2, 3]  1
    1     [1, 2]  2
    2         []  3
    3         []  4
    
    In [102]:    
    d['x'].apply(len)
    
    Out[102]:
    0    3
    1    2
    2    0
    3    0
    Name: x, dtype: int64
    
    0 讨论(0)
  • 2021-01-04 02:48

    To extend the accepted answer, apply calls can be particularly expensive - the same task can be accomplished without it by constructing a numpy array from scratch.

    isna = df['x'].isna()
    df.loc[isna, 'x'] = pd.Series([[]] * isna.sum()).values
    

    A quick timing comparison:

    def empty_assign_1(s):
        s.isna().apply(lambda x: [])
    
    def empty_assign_2(s):
        pd.Series([[]] * s.isna().sum()).values
    
    series = pd.Series(np.random.choice([1, 2, np.nan], 1000000))
    
    %timeit empty_assign_1(series)
    >>> 172 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    %timeit empty_assign_2(series)
    >>> 19.5 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Nearly 10 times faster!

    0 讨论(0)
提交回复
热议问题