Mapping values in place (for example with Gender) from string to int in Pandas dataframe [duplicate]

后端未结

关注

 3  872

难免孤独

相关标签:

3条回答

后悔当初

2021-01-23 15:23
My instinct would have suggested to use .map(), but I made a comparison between your solution and map, based on a dataframe with 1500 random male/female values.
```
%timeit df_base['Sex_new'] = df_base['Sex'].map({'male': 0,'female': 1})
1000 loops, best of 3: 653 µs per loop
```
Edited Based on coldspeeds comment, and because reassigning it is a better comparison with the others:
```
%timeit df_base['Sex_new'] = df_base['Sex'].replace(['male','female'],[0,1])
1000 loops, best of 3: 968 µs per loop
```
So actually slower .map()...!

~~So based on this example, your 'shoddy' solution seems faster than .map()...~~

Edit

pygo's solution:
```
%timeit df_base['Sex_new'] = np.where(df_base['Sex'] == 'male', 0, 1)
1000 loops, best of 3: 331 µs per loop
```
So faster!

Jezrael's solution with .astype(int):
```
%timeit df_base['Sex_new'] = (df_base['Sex'] == 'female').astype(int)
1000 loops, best of 3: 388 µs per loop
```
So also faster than .map() and .replace().
0 讨论(0)
发布评论:

提交评论
- 加载中...

梦谈多话

2021-01-23 15:37

I think here is better/faster use map by dictionary if only male and female exist in column Sex:

df_base['Sex'] = df_base['Sex'].map(dict(zip(['male','female'],[0,1]))

What is same like:

df_base['Sex'] = df_base['Sex'].map({'male': 0,'female': 1})

Solution if exist only female and male values is cast boolean mask to integers True/False to 1,0:

df_base['Sex'] = (df_base['Sex'] == 'female').astype(int)

Performance:

np.random.seed(2019)

import perfplot    

def ma(df):
    df = df.copy()
    df['Sex_new'] = df['Sex'].map({'male': 0,'female': 1})
    return df

def rep1(df):
    df = df.copy()
    df['Sex'] = df['Sex'].replace(['male','female'],[0,1])
    return df

def nwhere(df):
    df = df.copy()
    df['Sex_new'] = np.where(df['Sex'] == 'male', 0, 1)
    return df

def mask1(df):
    df = df.copy()
    df['Sex_new'] = (df['Sex'] == 'female').astype(int)
    return df

def mask2(df):
    df = df.copy()
    df['Sex_new'] = (df['Sex'].values == 'female').astype(int)
    return df


def make_df(n):
    df = pd.DataFrame({'Sex': np.random.choice(['male','female'], size=n)})

    return df

perfplot.show(
    setup=make_df,
    kernels=[ma,  rep1, nwhere, mask1, mask2],
    n_range=[2**k for k in range(2, 18)],
    logx=True,
    logy=True,
    equality_check=False,  # rows may appear in different order
    xlabel='len(df)')

Conclusion:

If replace only 2 values is slowiest replace, numpy.where, map and mask are similar. For improve performance compare by numpy array with .values.
Also all depends of data, so best test with real data.

0 讨论(0)

傲寒

2021-01-23 15:44

Another solution you can use with np.where:

Just an example DataFrame:

>>> df
      Sex
0    male
1  female
2  female
3  female
4    male

Based on the condition create new column new_Sex

>>> df['new_Sex'] = np.where(df['Sex'] == 'male', 0, 1)
>>> df
      Sex  new_Sex
0    male        0
1  female        1
2  female        1
3  female        1
4    male        0

OR:

>>> df['new_Sex'] = np.where(df['Sex'] != 'male', 1, 0)
>>> df
      Sex  new_Sex
0    male        0
1  female        1
2  female        1
3  female        1
4    male        0

0 讨论(0)

热议问题