Is pd.get_dummies one-hot encoding?

前端 未结 2 1759
一整个雨季
一整个雨季 2021-02-01 09:10

Given the difference between one-hot encoding and dummy coding, is the pandas.get_dummies method one-hot encoding when using default parameters (i.e. drop_fir

相关标签:
2条回答
  • 2021-02-01 09:39

    First question: yes, pd.get_dummies() is one-hot encoding in its default state; see example below, from pd.get_dummies docs:

    s = pd.Series(list('abca'))
    pd.get_dummies(s, drop_first=False)
    

    Second question: [edited now that OP includes code example] yes, if you are one-hot encoding the inputs to a logistic regression model, it is appropriate to skip the intercept.

    0 讨论(0)
  • 2021-02-01 09:45

    Dummies are any variables that are either one or zero for each observation. pd.get_dummies when applied to a column of categories where we have one category per observation will produce a new column (variable) for each unique categorical value. It will place a one in the column corresponding to the categorical value present for that observation. This is equivalent to one hot encoding.

    One-hot encoding is characterized by having only one one per set of categorical values per observation.

    Consider the series s

    s = pd.Series(list('AABBCCABCDDEE'))
    
    s
    
    0     A
    1     A
    2     B
    3     B
    4     C
    5     C
    6     A
    7     B
    8     C
    9     D
    10    D
    11    E
    12    E
    dtype: object
    

    pd.get_dummies will produce one-hot encoding. And yes! it is absolutely appropriate to not fit the intercept.

    pd.get_dummies(s)
    
        A  B  C  D  E
    0   1  0  0  0  0
    1   1  0  0  0  0
    2   0  1  0  0  0
    3   0  1  0  0  0
    4   0  0  1  0  0
    5   0  0  1  0  0
    6   1  0  0  0  0
    7   0  1  0  0  0
    8   0  0  1  0  0
    9   0  0  0  1  0
    10  0  0  0  1  0
    11  0  0  0  0  1
    12  0  0  0  0  1
    

    However, if you had s include different data and used pd.Series.str.get_dummies

    s = pd.Series('A|B,A,B,B,C|D,D|B,A,B,C,A|D'.split(','))
    
    s
    
    0    A|B
    1      A
    2      B
    3      B
    4    C|D
    5    D|B
    6      A
    7      B
    8      C
    9    A|D
    dtype: object
    

    Then get_dummies produces dummy variables that are not one-hot encoded and you could theoretically leave the intercept.

    s.str.get_dummies()
    
       A  B  C  D
    0  1  1  0  0
    1  1  0  0  0
    2  0  1  0  0
    3  0  1  0  0
    4  0  0  1  1
    5  0  1  0  1
    6  1  0  0  0
    7  0  1  0  0
    8  0  0  1  0
    9  1  0  0  1
    
    0 讨论(0)
提交回复
热议问题