Pandas - Convert a categorical column to binary encoded form

前端 未结 2 1843
面向向阳花
面向向阳花 2021-02-09 22:10

I have a dataset that looks like so -

     yyyy      month        tmax         tmin
0    1908    January         5.0         -1.4
1    1908   February         7         


        
2条回答
  •  死守一世寂寞
    2021-02-09 22:46

    I think you need get_dummies:

    df = pd.get_dummies(df['month'])
    

    And if need add new columns to original and remove month use join with pop:

    df2 = df.join(pd.get_dummies(df.pop('month')))
    print (df2.head())
       yyyy  tmax  tmin  April  August  December  February  January  July  June  \
    0  1908   5.0  -1.4      0       0         0         0        1     0     0   
    1  1908   7.3   1.9      0       0         0         1        0     0     0   
    2  1908   6.2   0.3      0       0         0         0        0     0     0   
    3  1908   7.4   2.1      1       0         0         0        0     0     0   
    4  1908  16.5   7.7      0       0         0         0        0     0     0   
    
       March  May  November  October  September  
    0      0    0         0        0          0  
    1      0    0         0        0          0  
    2      1    0         0        0          0  
    3      0    0         0        0          0  
    4      0    1         0        0          0  
    

    If NOT need remove column month:

    df2 = df.join(pd.get_dummies(df['month']))
    print (df2.head())
       yyyy     month  tmax  tmin  April  August  December  February  January  \
    0  1908   January   5.0  -1.4      0       0         0         0        1   
    1  1908  February   7.3   1.9      0       0         0         1        0   
    2  1908     March   6.2   0.3      0       0         0         0        0   
    3  1908     April   7.4   2.1      1       0         0         0        0   
    4  1908       May  16.5   7.7      0       0         0         0        0   
    
       July  June  March  May  November  October  September  
    0     0     0      0    0         0        0          0  
    1     0     0      0    0         0        0          0  
    2     0     0      1    0         0        0          0  
    3     0     0      0    0         0        0          0  
    4     0     0      0    1         0        0          0  
    

    If need sort columns there is more possible solutions - use reindex or reindex_axis:

    months = ['January', 'February', 'March','April' ,'May',  'June', 'July', 'August', 'September','October', 'November','December']
    
    df1 = pd.get_dummies(df['month']).reindex_axis(months, 1)
    print (df1.head())
       January  February  March  April  May  June  July  August  September  \
    0        1         0      0      0    0     0     0       0          0   
    1        0         1      0      0    0     0     0       0          0   
    2        0         0      1      0    0     0     0       0          0   
    3        0         0      0      1    0     0     0       0          0   
    4        0         0      0      0    1     0     0       0          0   
    
       October  November  December  
    0        0         0         0  
    1        0         0         0  
    2        0         0         0  
    3        0         0         0  
    4        0         0         0  
    
    df1 = pd.get_dummies(df['month']).reindex(columns=months)
    print (df1.head())
       January  February  March  April  May  June  July  August  September  \
    0        1         0      0      0    0     0     0       0          0   
    1        0         1      0      0    0     0     0       0          0   
    2        0         0      1      0    0     0     0       0          0   
    3        0         0      0      1    0     0     0       0          0   
    4        0         0      0      0    1     0     0       0          0   
    
       October  November  December  
    0        0         0         0  
    1        0         0         0  
    2        0         0         0  
    3        0         0         0  
    4        0         0         0  
    

    Or convert column month to ordered categorical:

    df1 = pd.get_dummies(df['month'].astype('category', categories=months, ordered=True))
    print (df1.head())
       January  February  March  April  May  June  July  August  September  \
    0        1         0      0      0    0     0     0       0          0   
    1        0         1      0      0    0     0     0       0          0   
    2        0         0      1      0    0     0     0       0          0   
    3        0         0      0      1    0     0     0       0          0   
    4        0         0      0      0    1     0     0       0          0   
    
       October  November  December  
    0        0         0         0  
    1        0         0         0  
    2        0         0         0  
    3        0         0         0  
    4        0         0         0  
    

提交回复
热议问题