how to use classwt in randomForest of R?

后端未结

关注

 3  1009

I have a highly imbalanced data set with target class instances in the following ratio 60000:1000:1000:50 (i.e. a total of 4 classes). I want to use randomFor


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  孤独总比滥情好        
                
              
                            
                2021-02-05 03:57
              
            
            
                                                                       
classwt is correctly passed on to randomForest, check this example:

library(randomForest)
rf = randomForest(Species~., data = iris, classwt = c(1E-5,1E-5,1E5))
rf

#Call:
# randomForest(formula = Species ~ ., data = iris, classwt = c(1e-05, 1e-05, 1e+05)) 
#               Type of random forest: classification
#                     Number of trees: 500
#No. of variables tried at each split: 2
#
#        OOB estimate of  error rate: 66.67%
#Confusion matrix:
#           setosa versicolor virginica class.error
#setosa          0          0        50           1
#versicolor      0          0        50           1
#virginica       0          0        50           0


Class weights are the priors on the outcomes. You need to balance them to achieve the results you want.



On strata and sampsize this answer might be of help: https://stackoverflow.com/a/20151341/2874779

In general, sampsize with the same size for all classes seems reasonable. strata is a factor that's going to be used for stratified resampling, in your case you don't need to input anything.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  爱一瞬间的悲伤        
                
              
                            
                2021-02-05 04:03
              
            
            
                                                                       
You can pass a named vector to classwt.
But how weight is calculated is very tricky.

For example, if your target variable y has two classes "Y" and "N", and you want to set balanced weight, you should do:

wn = sum(y="N")/length(y)
wy = 1


Then set classwt = c("N"=wn, "Y"=wy)  

Alternatively, you may want to use ranger package. This package offers flexible builds of random forests, and specifying class / sample weight is easy. ranger is also supported by caret package.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  谎友^        
                
              
                            
                2021-02-05 04:19
              
            
            
                                                                       
Random forests are probably not the right classifier for your problem as they are extremely sensitive to class imbalance.

When I have an unbalanced problem I usually deal with it using sampsize like you tried. However I make all the strata equal size and I use sampling without replacement.
Sampling without replacement is important here, as otherwise samples from the smaller classes will contain many more repetitions, and the class will still be underrepresented. It may be necessary to increase mtry if this approach leads to small samples, sometimes even setting it to the total number of features.

This works quiet well when there are enough items in the smallest class. However, your smallest class has only 50 items. I doubt you would get useful results with sampsize=c(50,50,50,50).

Also classwt has never worked for me.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复