How to split data into training/testing sets using sample function

前端 未结 24 1407
猫巷女王i
猫巷女王i 2020-11-22 10:43

I\'ve just started using R and I\'m not sure how to incorporate my dataset with the following sample code:

sample(x, size, replace = FALSE, prob = NULL)


        
相关标签:
24条回答
  • 2020-11-22 10:43

    Use base R. Function runif generates uniformly distributed values from 0 to 1.By varying cutoff value (train.size in example below), you will always have approximately the same percentage of random records below the cutoff value.

    data(mtcars)
    set.seed(123)
    
    #desired proportion of records in training set
    train.size<-.7
    #true/false vector of values above/below the cutoff above
    train.ind<-runif(nrow(mtcars))<train.size
    
    #train
    train.df<-mtcars[train.ind,]
    
    
    #test
    test.df<-mtcars[!train.ind,]
    
    0 讨论(0)
  • 2020-11-22 10:45

    I will split 'a' into train(70%) and test(30%)

        a # original data frame
        library(dplyr)
        train<-sample_frac(a, 0.7)
        sid<-as.numeric(rownames(train)) # because rownames() returns character
        test<-a[-sid,]
    

    done

    0 讨论(0)
  • 2020-11-22 10:49

    Below a function that create a list of sub-samples of the same size which is not exactly what you wanted but might prove usefull for others. In my case to create multiple classification trees on smaller samples to test overfitting :

    df_split <- function (df, number){
      sizedf      <- length(df[,1])
      bound       <- sizedf/number
      list        <- list() 
      for (i in 1:number){
        list[i] <- list(df[((i*bound+1)-bound):(i*bound),])
      }
      return(list)
    }
    

    Example :

    x <- matrix(c(1:10), ncol=1)
    x
    # [,1]
    # [1,]    1
    # [2,]    2
    # [3,]    3
    # [4,]    4
    # [5,]    5
    # [6,]    6
    # [7,]    7
    # [8,]    8
    # [9,]    9
    #[10,]   10
    
    x.split <- df_split(x,5)
    x.split
    # [[1]]
    # [1] 1 2
    
    # [[2]]
    # [1] 3 4
    
    # [[3]]
    # [1] 5 6
    
    # [[4]]
    # [1] 7 8
    
    # [[5]]
    # [1] 9 10
    
    0 讨论(0)
  • 2020-11-22 10:51

    My solution is basically the same as dickoa's but a little easier to interpret:

    data(mtcars)
    n = nrow(mtcars)
    trainIndex = sample(1:n, size = round(0.7*n), replace=FALSE)
    train = mtcars[trainIndex ,]
    test = mtcars[-trainIndex ,]
    
    0 讨论(0)
  • 2020-11-22 10:53

    scorecard package has a useful function for that, where you can specify the ratio and seed

    library(scorecard)
    
    dt_list <- split_df(mtcars, ratio = 0.75, seed = 66)
    

    The test and train data are stored in a list and can be accessed by calling dt_list$train and dt_list$test

    0 讨论(0)
  • 2020-11-22 10:53

    Use caTools package in R sample code will be as follows:-

    data
    split = sample.split(data$DependentcoloumnName, SplitRatio = 0.6)
    training_set = subset(data, split == TRUE)
    test_set = subset(data, split == FALSE)
    
    0 讨论(0)
提交回复
热议问题