I\'ve just started using R and I\'m not sure how to incorporate my dataset with the following sample code:
sample(x, size, replace = FALSE, prob = NULL)
Use base R. Function runif
generates uniformly distributed values from 0 to 1.By varying cutoff value (train.size in example below), you will always have approximately the same percentage of random records below the cutoff value.
data(mtcars)
set.seed(123)
#desired proportion of records in training set
train.size<-.7
#true/false vector of values above/below the cutoff above
train.ind<-runif(nrow(mtcars))<train.size
#train
train.df<-mtcars[train.ind,]
#test
test.df<-mtcars[!train.ind,]
I will split 'a' into train(70%) and test(30%)
a # original data frame
library(dplyr)
train<-sample_frac(a, 0.7)
sid<-as.numeric(rownames(train)) # because rownames() returns character
test<-a[-sid,]
done
Below a function that create a list
of sub-samples of the same size which is not exactly what you wanted but might prove usefull for others. In my case to create multiple classification trees on smaller samples to test overfitting :
df_split <- function (df, number){
sizedf <- length(df[,1])
bound <- sizedf/number
list <- list()
for (i in 1:number){
list[i] <- list(df[((i*bound+1)-bound):(i*bound),])
}
return(list)
}
Example :
x <- matrix(c(1:10), ncol=1)
x
# [,1]
# [1,] 1
# [2,] 2
# [3,] 3
# [4,] 4
# [5,] 5
# [6,] 6
# [7,] 7
# [8,] 8
# [9,] 9
#[10,] 10
x.split <- df_split(x,5)
x.split
# [[1]]
# [1] 1 2
# [[2]]
# [1] 3 4
# [[3]]
# [1] 5 6
# [[4]]
# [1] 7 8
# [[5]]
# [1] 9 10
My solution is basically the same as dickoa's but a little easier to interpret:
data(mtcars)
n = nrow(mtcars)
trainIndex = sample(1:n, size = round(0.7*n), replace=FALSE)
train = mtcars[trainIndex ,]
test = mtcars[-trainIndex ,]
scorecard
package has a useful function for that, where you can specify the ratio and seed
library(scorecard)
dt_list <- split_df(mtcars, ratio = 0.75, seed = 66)
The test and train data are stored in a list and can be accessed by calling dt_list$train
and dt_list$test
Use caTools package in R sample code will be as follows:-
data
split = sample.split(data$DependentcoloumnName, SplitRatio = 0.6)
training_set = subset(data, split == TRUE)
test_set = subset(data, split == FALSE)