Is pd.get_dummies one-hot encoding?

前端未结

关注

 2  1763

Given the difference between one-hot encoding and dummy coding, is the pandas.get_dummies method one-hot encoding when using default parameters (i.e. drop_fir


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  生来不讨喜        
                
              
                            
                2021-02-01 09:39
              
            
            
                                                                       
First question: yes, pd.get_dummies() is one-hot encoding in its default state; see example below, from pd.get_dummies docs:

s = pd.Series(list('abca'))
pd.get_dummies(s, drop_first=False)


Second question: [edited now that OP includes code example] yes, if you are one-hot encoding the inputs to a logistic regression model, it is appropriate to skip the intercept.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北海茫月        
                
              
                            
                2021-02-01 09:45
              
            
            
                                                                       
Dummies are any variables that are either one or zero for each observation.  pd.get_dummies when applied to a column of categories where we have one category per observation will produce a new column (variable) for each unique categorical value.  It will place a one in the column corresponding to the categorical value present for that observation.  This is equivalent to one hot encoding.

One-hot encoding is characterized by having only one one per set of categorical values per observation.

Consider the series s

s = pd.Series(list('AABBCCABCDDEE'))

s

0     A
1     A
2     B
3     B
4     C
5     C
6     A
7     B
8     C
9     D
10    D
11    E
12    E
dtype: object


pd.get_dummies will produce one-hot encoding.  And yes! it is absolutely appropriate to not fit the intercept.

pd.get_dummies(s)

    A  B  C  D  E
0   1  0  0  0  0
1   1  0  0  0  0
2   0  1  0  0  0
3   0  1  0  0  0
4   0  0  1  0  0
5   0  0  1  0  0
6   1  0  0  0  0
7   0  1  0  0  0
8   0  0  1  0  0
9   0  0  0  1  0
10  0  0  0  1  0
11  0  0  0  0  1
12  0  0  0  0  1


However, if you had s include different data and used pd.Series.str.get_dummies

s = pd.Series('A|B,A,B,B,C|D,D|B,A,B,C,A|D'.split(','))

s

0    A|B
1      A
2      B
3      B
4    C|D
5    D|B
6      A
7      B
8      C
9    A|D
dtype: object


Then get_dummies produces dummy variables that are not one-hot encoded and you could theoretically leave the intercept.

s.str.get_dummies()

   A  B  C  D
0  1  1  0  0
1  1  0  0  0
2  0  1  0  0
3  0  1  0  0
4  0  0  1  1
5  0  1  0  1
6  1  0  0  0
7  0  1  0  0
8  0  0  1  0
9  1  0  0  1

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复