How to split data into training/testing sets using sample function

前端未结

关注

 24  1424

I\'ve just started using R and I\'m not sure how to incorporate my dataset with the following sample code:

sample(x, size, replace = FALSE, prob = NULL)


                      
              相关标签:


      
      
        
          24条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  独厮守ぢ        
                
              
                            
                2020-11-22 10:54
              
            
            
                                                                       
There is a very simple way to select a number of rows using the R index for rows and columns. This lets you CLEANLY split the data set given a number of rows - say the 1st 80% of your data.
In R all rows and columns are indexed so DataSetName[1,1] is the value assigned to the first column and first row of "DataSetName". I can select rows using [x,] and columns using [,x]
For example: If I have a data set conveniently named "data" with 100 rows I can view the first 80 rows using

View(data[1:80,])

In the same way I can select these rows and subset them using:

train = data[1:80,]
test = data[81:100,]

Now I have my data split into two parts without the possibility of resampling. Quick and easy.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  無奈伤痛        
                
              
                            
                2020-11-22 10:55
              
            
            
                                                                       
Just a more brief and simple way using awesome dplyr library:

library(dplyr)
set.seed(275) #to get repeatable data

data.train <- sample_frac(Default, 0.7)

train_index <- as.numeric(rownames(data.train))
data.test <- Default[-train_index, ]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  闹比i        
                
              
                            
                2020-11-22 10:55
              
            
            
                                                                       
My solution shuffles the rows, then takes the first 75% of the rows as train and the last 25% as test. Super simples!

row_count <- nrow(orders_pivotted)
shuffled_rows <- sample(row_count)
train <- orders_pivotted[head(shuffled_rows,floor(row_count*0.75)),]
test <- orders_pivotted[tail(shuffled_rows,floor(row_count*0.25)),]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  隐瞒了意图╮        
                
              
                            
                2020-11-22 10:56
              
            
            
                                                                       
If you type: 

?sample


If will launch a help menu to explain what the parameters of the sample function mean. 

I am not an expert, but here is some code I have:

data <- data.frame(matrix(rnorm(400), nrow=100))
splitdata <- split(data[1:nrow(data),],sample(rep(1:4,as.integer(nrow(data)/4))))
test <- splitdata[[1]]
train <- rbind(splitdata[[1]],splitdata[[2]],splitdata[[3]])


This will give you 75% train and 25% test.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  猫巷女王i        
                
              
                            
                2020-11-22 10:56
              
            
            
                                                                       
require(caTools)

set.seed(101)            #This is used to create same samples everytime

split1=sample.split(data$anycol,SplitRatio=2/3)

train=subset(data,split1==TRUE)

test=subset(data,split1==FALSE)


The sample.split() function will add one extra column 'split1' to dataframe and 2/3 of the rows will have this value as TRUE and others as FALSE.Now the rows where split1 is TRUE will be copied into train and other rows will be copied to test dataframe.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  执笔经年        
                
              
                            
                2020-11-22 10:57
              
            
            
                                                                       
Beware of sample for splitting if you look for reproducible results. If your data changes even slightly, the split will vary even if you use set.seed. For example, imagine the sorted list of IDs in you data is all the numbers between 1 and 10. If you just dropped one observation, say 4, sampling by location would yield a different results because now 5 to 10 all moved places. 

An alternative method is to use a hash function to map IDs into some pseudo random numbers and then sample on the mod of these numbers. This sample is more stable because assignment is now determined by the hash of each observation, and not by its relative position.

For example:

require(openssl)  # for md5
require(data.table)  # for the demo data

set.seed(1)  # this won't help `sample`

population <- as.character(1e5:(1e6-1))  # some made up ID names

N <- 1e4  # sample size

sample1 <- data.table(id = sort(sample(population, N)))  # randomly sample N ids
sample2 <- sample1[-sample(N, 1)]  # randomly drop one observation from sample1

# samples are all but identical
sample1
sample2
nrow(merge(sample1, sample2))


[1] 9999

# row splitting yields very different test sets, even though we've set the seed
test <- sample(N-1, N/2, replace = F)

test1 <- sample1[test, .(id)]
test2 <- sample2[test, .(id)]
nrow(test1)


[1] 5000

nrow(merge(test1, test2))


[1] 2653

# to fix that, we can use some hash function to sample on the last digit

md5_bit_mod <- function(x, m = 2L) {
  # Inputs: 
  #  x: a character vector of ids
  #  m: the modulo divisor (modify for split proportions other than 50:50)
  # Output: remainders from dividing the first digit of the md5 hash of x by m
  as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
}

# hash splitting preserves the similarity, because the assignment of test/train 
# is determined by the hash of each obs., and not by its relative location in the data
# which may change 
test1a <- sample1[md5_bit_mod(id) == 0L, .(id)]
test2a <- sample2[md5_bit_mod(id) == 0L, .(id)]
nrow(merge(test1a, test2a))


[1] 5057

nrow(test1a)


[1] 5057

sample size is not exactly 5000 because assignment is probabilistic, but it shouldn't be a problem in large samples thanks to the law of large numbers.

See also: http://blog.richardweiss.org/2016/12/25/hash-splits.html
and https://crypto.stackexchange.com/questions/20742/statistical-properties-of-hash-functions-when-calculating-modulo
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
3
4
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复