问题
I would like to fill missing values in one column with values from another column, using fillna
method.
(I read that looping through each row would be very bad practice and that it would be better to do everything in one go but I could not find out how to do it with fillna
.)
Data before:
Day Cat1 Cat2
1 cat mouse
2 dog elephant
3 cat giraf
4 NaN ant
Data after:
Day Cat1 Cat2
1 cat mouse
2 dog elephant
3 cat giraf
4 ant ant
回答1:
You can provide this column to fillna
(see docs), it will use those values on matching indexes to fill:
In [17]: df['Cat1'].fillna(df['Cat2'])
Out[17]:
0 cat
1 dog
2 cat
3 ant
Name: Cat1, dtype: object
回答2:
You could do
df.Cat1 = np.where(df.Cat1.isnull(), df.Cat2, df.Cat1)
The overall construct on the RHS uses the ternary pattern from the pandas cookbook (which it pays to read in any case). It's a vector version of a? b: c
.
回答3:
Just use the value
parameter instead of method
:
In [20]: df
Out[20]:
Cat1 Cat2 Day
0 cat mouse 1
1 dog elephant 2
2 cat giraf 3
3 NaN ant 4
In [21]: df.Cat1 = df.Cat1.fillna(value=df.Cat2)
In [22]: df
Out[22]:
Cat1 Cat2 Day
0 cat mouse 1
1 dog elephant 2
2 cat giraf 3
3 ant ant 4
回答4:
pandas.DataFrame.combine_first also works.
(Attention: since "Result index columns will be the union of the respective indexes and columns", you should check the index and columns are matched.)
import numpy as np
import pandas as pd
df = pd.DataFrame([["1","cat","mouse"],
["2","dog","elephant"],
["3","cat","giraf"],
["4",np.nan,"ant"]],columns=["Day","Cat1","Cat2"])
In: df["Cat1"].combine_first(df["Cat2"])
Out:
0 cat
1 dog
2 cat
3 ant
Name: Cat1, dtype: object
Compare with other answers:
%timeit df["Cat1"].combine_first(df["Cat2"])
181 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit df['Cat1'].fillna(df['Cat2'])
253 µs ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.where(df.Cat1.isnull(), df.Cat2, df.Cat1)
88.1 µs ± 793 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I didn't use this method below:
def is_missing(Cat1,Cat2):
if np.isnan(Cat1):
return Cat2
else:
return Cat1
df['Cat1'] = df.apply(lambda x: is_missing(x['Cat1'],x['Cat2']),axis=1)
because it will raise an Exception:
TypeError: ("ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''", 'occurred at index 0')
which means np.isnan can be applied to NumPy arrays of native dtype (such as np.float64), but raises TypeError when applied to object arrays.
So I revise the method:
def is_missing(Cat1,Cat2):
if pd.isnull(Cat1):
return Cat2
else:
return Cat1
%timeit df.apply(lambda x: is_missing(x['Cat1'],x['Cat2']),axis=1)
701 µs ± 7.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
回答5:
Here is a more general approach (fillna method is probably better)
def is_missing(Cat1,Cat2):
if np.isnan(Cat1):
return Cat2
else:
return Cat1
df['Cat1'] = df.apply(lambda x: is_missing(x['Cat1'],x['Cat2']),axis=1)
来源:https://stackoverflow.com/questions/30357276/how-to-pass-another-entire-column-as-argument-to-pandas-fillna