I think here is better/faster use map by dictionary if only male
and female
exist in column Sex
:
df_base['Sex'] = df_base['Sex'].map(dict(zip(['male','female'],[0,1]))
What is same like:
df_base['Sex'] = df_base['Sex'].map({'male': 0,'female': 1})
Solution if exist only female
and male
values is cast boolean mask to integers True/False
to 1,0
:
df_base['Sex'] = (df_base['Sex'] == 'female').astype(int)
Performance:
np.random.seed(2019)
import perfplot
def ma(df):
df = df.copy()
df['Sex_new'] = df['Sex'].map({'male': 0,'female': 1})
return df
def rep1(df):
df = df.copy()
df['Sex'] = df['Sex'].replace(['male','female'],[0,1])
return df
def nwhere(df):
df = df.copy()
df['Sex_new'] = np.where(df['Sex'] == 'male', 0, 1)
return df
def mask1(df):
df = df.copy()
df['Sex_new'] = (df['Sex'] == 'female').astype(int)
return df
def mask2(df):
df = df.copy()
df['Sex_new'] = (df['Sex'].values == 'female').astype(int)
return df
def make_df(n):
df = pd.DataFrame({'Sex': np.random.choice(['male','female'], size=n)})
return df
perfplot.show(
setup=make_df,
kernels=[ma, rep1, nwhere, mask1, mask2],
n_range=[2**k for k in range(2, 18)],
logx=True,
logy=True,
equality_check=False, # rows may appear in different order
xlabel='len(df)')
Conclusion:
If replace only 2 values is slowiest replace
, numpy.where, map and mask
are similar. For improve performance compare by numpy array with .values
.
Also all depends of data, so best test with real data.