Randomly split data by criterion into training and testing data set using R

Gidday,

I'm looking for a way to randomly split a data frame (e.g. 90/10 split) for testing and training of a model keeping a certain grouping criteria.

Imagine I have a data frame like this:

> test[1:20,]
                companycode     year    expenses         
    1                 C1          1     8.47720                 
    2                 C1          2     8.45250                 
    3                 C1          3     8.46280                 
    4                 C2          1 14828.90603                 
    5                 C3          1   665.21565                 
    6                 C3          2   290.66596                 
    7                 C3          3   865.56265                 
    8                 C3          4   6785.03586                
    9                 C3          5   312.02617                 
    10                C3          6   760.48740               
    11                C3          7  1155.76758                
    12                C4          1  4565.78313                 
    13                C4          2  3340.36540                 
    14                C4          3  2656.73030                 
    15                C4          4  1079.46098                 
    16                C5          1    60.57039                 
    17                C6          1  6282.48118                 
    18                C6          2  7419.32720                 
    19                C7          1   644.90571                 
    20                C8          1 58332.34945

What I'm trying to do is to split this data frame into a training and a testing set using a defined splitting criterion. Using the provided data, I want to split the data in a way that the companies are not mixed up in both data frames. Data set 1 contains different companies than data set 2.

Imagine a 90/10 split, an ideal split would look like this:

> data_90split

           companycode     year    expenses         

        4                 C2          1 14828.90603                                 
        12                C4          1  4565.78313                 
        13                C4          2  3340.36540                 
        14                C4          3  2656.73030                 
        15                C4          4  1079.46098                 
        16                C5          1    60.57039
        5                 C3          1   665.21565                 
        6                 C3          2   290.66596                 
        7                 C3          3   865.56265                 
        8                 C3          4   6785.03586                
        9                 C3          5   312.02617                 
        10                C3          6   760.48740               
        11                C3          7  1155.76758                 
        17                C6          1  6282.48118                 
        18                C6          2  7419.32720
        1                 C1          1     8.47720                 
        2                 C1          2     8.45250                 
        3                 C1          3     8.46280



 > data_10split
                    companycode     year   expenses
        20                C8          1 58332.34945 
        19                C7          1   644.90571

I hope I pointed out clearly what I'm looking for. Thanks for your feedback.

comps <- levels(df$companycode)

trn <- sample(comps, length(comps)*0.9)

df.trn <- subset(df, companycode %in% trn)
df.tst <- subset(df, !(companycode %in% trn))

This splits your data so that 90% of companies are in the training set and the rest in the test set.

This does not guarantee that 90% of your rows will be training and 10% test. The rigorous way to achieve this is left as an exercise for the reader. The non-rigorous way would be to repeat the sampling until you get proportions that are roughly correct.

Assuming no conditions on what groups you want, the following will randomly split your data frame into 90% and 10% partitions (stored in a list):

set.seed(1)
split(test, sample(1:nrow(test) > round(nrow(test) * .1)))

Produces:

$`FALSE`
   companycode year  expenses
10          C3    6  760.4874
12          C4    1 4565.7831

$`TRUE`
   companycode year    expenses
1           C1    1     8.47720
2           C1    2     8.45250
3           C1    3     8.46280
4           C2    1 14828.90603
5           C3    1   665.21565
6           C3    2   290.66596
7           C3    3   865.56265
8           C3    4  6785.03586
9           C3    5   312.02617
11          C3    7  1155.76758
13          C4    2  3340.36540
14          C4    3  2656.73030
15          C4    4  1079.46098
16          C5    1    60.57039
17          C6    1  6282.48118
18          C6    2  7419.32720
19          C7    1   644.90571
20          C8    1 58332.34945

来源：https://stackoverflow.com/questions/22518982/randomly-split-data-by-criterion-into-training-and-testing-data-set-using-r

标签

split

dataframe

random-sample