Creating a data partition using caret and data.table

后端未结

关注

 2  1486

I have a data.table in R which I want to use with caret package

set.seed(42)
trainingRows<-createDataPartition(DT$variable, p=0.75, list=FALSE)
head(train


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  旧巷少年郎        
                
              
                            
                2021-01-12 22:37
              
            
            
                                                                       
Roll you own

inTrain <- sample(MyDT[, .I], floor(MyDT[, .N] * .75))
Train <- MyDT[inTrain]
Test <- MyDT[-inTrain]


Or with Caret function you can just wrap trainingRows with a c().

 trainingRows<-createDataPartition(DT$variable, p=0.75, list=FALSE)
 Train <- DT[c(trainingRows)]
 Test <- DT[c(-trainingRows)]


===

Edit by Matt

Was going to add a comment, but too long.

I've seen sample(.I,...) being used quite a bit recently.  This is inefficient because it has to create the (potentially very long) .I vector which is just 1:nrow(DT). This is such a common case that R's sample() doesn't need you to pass that vector. Just pass the length. sample(nrow(DT)) already returns exactly the same result without having to create .I.  See ?sample.

Also, it's better to avoid variable name repetition wherever possible. More background here.

So instead of :

inTrain <- sample(MyDT[, .I], floor(MyDT[, .N] * .75))


I'd do :

inTrain <- MyDT[,sample(.N, floor(.N*.75))]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  被撕碎了的回忆        
                
              
                            
                2021-01-12 22:51
              
            
            
                                                                       
The reason is that createDataPartition produces integer vector with two dimensions where the second could be losslessly dropped.

You can simply reduce dimension of trainingRows using below:



DT[trainingRows[,1]]


The c() function from Bruce Pucci's answer will reduce dimension too.

This minor difference vs. data.frame was spotted long time ago and recently I've made PR #1275 to fill that gap.  
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复