Specify list of possible values for Pandas get_dummies

后端未结

关注

 4  1285

Suppose I have a Pandas DataFrame like the below and I\'m encoding categorical_1 for training in scikit-learn:

data = {\'numeric_1\':[12.1, 3.2, 5.5, 6.8, 9.9],


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  [愿得一人]        
                
              
                            
                2021-02-14 17:55
              
            
            
                                                                       
Isn't this a better answer?

data = pd.DataFrame({
    "values": [1, 2, 3, 4, 5, 6, 7],
    "categories": ["A", "A", "B", "B", "C", "C", "D"]
})

possibilites = ["A", "B", "C", "D", "E", "F"]

exists = data["categories"].tolist()

difference = pd.Series([item for item in possibilites if item not in exists])

target = data["categories"].append(pd.Series(difference))

target = target.reset_index(drop=True)

dummies = pd.get_dummies(
    target
)

dummies = dummies.drop(dummies.index[list(range(len(dummies)-len(difference), len(dummies)))])

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  猫巷女王i        
                
              
                            
                2021-02-14 18:06
              
            
            
                                                                       
I encountered the same problem as yours, that is how to unify the dummy categories between training data and testing data when using get_dummies() in Pandas. Then I found a solution when exploring the House Price competition in Kaggle, that is to process training data and testing data at the same time. Suppose you have two dataframes df_train and df_test (not containing target data in them).

all_data = pd.concat([df_train,df_test], axis=0)
all_data = pd.get_dummies(all_data) 
X_train  = all_data[:df_train.shape[0]]  # select the processed training data  
X_test   = all_data[-df_test.shape[0]:]  # select the processed testing data


Hope it helps.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一整个雨季        
                
              
                            
                2021-02-14 18:06
              
            
            
                                                                       
To handle the mismatch between the set of categorical values in train and test sets I used;

    length = train_categorical_data.shape[0]
    empty_col = np.zeros((length,1))
    test_categorical_data_processed = pd.DataFrame()
    for col in train_categorical_data.columns:
        test_categorical_data_processed[col] = test_categorical_data.get(col, empty_col)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  你的背包        
                
              
                            
                2021-02-14 18:11
              
            
            
                                                                       
First, if you want pandas to take more values simply add them to the list sent to the get_dummies method

data = {'numeric_1':[12.1, 3.2, 5.5, 6.8, 9.9], 
        'categorical_1':['A', 'B', 'C', 'B', 'B']}
frame = pd.DataFrame(data)
dummy_values = pd.get_dummies(data['categorical_1'] + ['D','E'])


as in python + on lists works as a concatenate operation, so

['A','B','C','B','B'] + ['D','E']


results in

['A', 'B', 'C', 'B', 'B', 'D', 'E']



  In my mind this is necessary to account for test data with a value for that column outside of the values used in the training set, but being a novice in machine learning, perhaps that is not necessary so I'm open to a different way to approach this.


From the machine learning perspective, it is quite redundant. This column is a categorical one, so value 'D' means completely nothing to the model, that never seen it before. If you are coding the features unary (which I assume after seeing that you create columns for each value) it is enough to simply represent these 'D', 'E' values with

A   B   C
0   0   0


(i assume that you represent the 'B' value with 0 1 0, 'C' with 0 0 1 etc.)

because if there were no such values in the training set, during testing - no model will distinguish between giving value 'D', or 'Elephant'

The only reason for such action would be to assume, that in the future you wish to add data with 'D' values, and simply do not want to modify the code, then it is reasonable to do it now, even though it could make training a bit more complex (as you add a dimension that as for now - carries completely no knowledge), but it seems a small problem.

If you are not going to encode it in the unary format, but rather want to use these values as one feature, simply with categorical values, then you would not need to create these "dummies" at all, and use a model which can work with such values, such as Naive Bayes, which could simply be trained with "Laplacian smoothing" to be able to work around non-existing values.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复