How to split data into training/testing sets using sample function

前端 未结 24 1423
猫巷女王i
猫巷女王i 2020-11-22 10:43

I\'ve just started using R and I\'m not sure how to incorporate my dataset with the following sample code:

sample(x, size, replace = FALSE, prob = NULL)


        
相关标签:
24条回答
  • 2020-11-22 10:59

    I think this would solve the problem:

    df = data.frame(read.csv("data.csv"))
    # Split the dataset into 80-20
    numberOfRows = nrow(df)
    bound = as.integer(numberOfRows *0.8)
    train=df[1:bound ,2]
    test1= df[(bound+1):numberOfRows ,2]
    
    0 讨论(0)
  • 2020-11-22 11:02

    After looking through all the different methods posted here, I didn't see anyone utilize TRUE/FALSE to select and unselect data. So I thought I would share a method utilizing that technique.

    n = nrow(dataset)
    split = sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.75, 0.25))
    
    training = dataset[split, ]
    testing = dataset[!split, ]
    

    Explanation

    There are multiple ways of selecting data from R, most commonly people use positive/negative indices to select/unselect respectively. However, the same functionalities can be achieved by using TRUE/FALSE to select/unselect.

    Consider the following example.

    # let's explore ways to select every other element
    data = c(1, 2, 3, 4, 5)
    
    
    # using positive indices to select wanted elements
    data[c(1, 3, 5)]
    [1] 1 3 5
    
    # using negative indices to remove unwanted elements
    data[c(-2, -4)]
    [1] 1 3 5
    
    # using booleans to select wanted elements
    data[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
    [1] 1 3 5
    
    # R recycles the TRUE/FALSE vector if it is not the correct dimension
    data[c(TRUE, FALSE)]
    [1] 1 3 5
    
    0 讨论(0)
  • 2020-11-22 11:04

    I would use dplyr for this, makes it super simple. It does require an id variable in your data set, which is a good idea anyway, not only for creating sets but also for traceability during your project. Add it if doesn't contain already.

    mtcars$id <- 1:nrow(mtcars)
    train <- mtcars %>% dplyr::sample_frac(.75)
    test  <- dplyr::anti_join(mtcars, train, by = 'id')
    
    0 讨论(0)
  • 2020-11-22 11:04

    I can suggest using the rsample package:

    # choosing 75% of the data to be the training data
    data_split <- initial_split(data, prop = .75)
    # extracting training data and test data as two seperate dataframes
    data_train <- training(data_split)
    data_test  <- testing(data_split)
    
    0 讨论(0)
  • 2020-11-22 11:05

    This is almost the same code, but in more nice look

    bound <- floor((nrow(df)/4)*3)         #define % of training and test set
    
    df <- df[sample(nrow(df)), ]           #sample rows 
    df.train <- df[1:bound, ]              #get training set
    df.test <- df[(bound+1):nrow(df), ]    #get test set
    
    0 讨论(0)
  • 2020-11-22 11:07
    library(caret)
    intrain<-createDataPartition(y=sub_train$classe,p=0.7,list=FALSE)
    training<-m_train[intrain,]
    testing<-m_train[-intrain,]
    
    0 讨论(0)
提交回复
热议问题