Randomly split data by criterion into training and testing data set using R

前端 未结 2 1616
灰色年华
灰色年华 2020-12-20 02:45

Gidday,

I\'m looking for a way to randomly split a data frame (e.g. 90/10 split) for testing and training of a model keeping a certain grouping criteria.

Im

相关标签:
2条回答
  • 2020-12-20 03:11

    Assuming no conditions on what groups you want, the following will randomly split your data frame into 90% and 10% partitions (stored in a list):

    set.seed(1)
    split(test, sample(1:nrow(test) > round(nrow(test) * .1)))
    

    Produces:

    $`FALSE`
       companycode year  expenses
    10          C3    6  760.4874
    12          C4    1 4565.7831
    
    $`TRUE`
       companycode year    expenses
    1           C1    1     8.47720
    2           C1    2     8.45250
    3           C1    3     8.46280
    4           C2    1 14828.90603
    5           C3    1   665.21565
    6           C3    2   290.66596
    7           C3    3   865.56265
    8           C3    4  6785.03586
    9           C3    5   312.02617
    11          C3    7  1155.76758
    13          C4    2  3340.36540
    14          C4    3  2656.73030
    15          C4    4  1079.46098
    16          C5    1    60.57039
    17          C6    1  6282.48118
    18          C6    2  7419.32720
    19          C7    1   644.90571
    20          C8    1 58332.34945
    
    0 讨论(0)
  • 2020-12-20 03:26
    comps <- levels(df$companycode)
    
    trn <- sample(comps, length(comps)*0.9)
    
    df.trn <- subset(df, companycode %in% trn)
    df.tst <- subset(df, !(companycode %in% trn))
    

    This splits your data so that 90% of companies are in the training set and the rest in the test set.

    This does not guarantee that 90% of your rows will be training and 10% test. The rigorous way to achieve this is left as an exercise for the reader. The non-rigorous way would be to repeat the sampling until you get proportions that are roughly correct.

    0 讨论(0)
提交回复
热议问题