Gidday,
I\'m looking for a way to randomly split a data frame (e.g. 90/10 split) for testing and training of a model keeping a certain grouping criteria.
Im
Assuming no conditions on what groups you want, the following will randomly split your data frame into 90% and 10% partitions (stored in a list):
set.seed(1)
split(test, sample(1:nrow(test) > round(nrow(test) * .1)))
Produces:
$`FALSE`
companycode year expenses
10 C3 6 760.4874
12 C4 1 4565.7831
$`TRUE`
companycode year expenses
1 C1 1 8.47720
2 C1 2 8.45250
3 C1 3 8.46280
4 C2 1 14828.90603
5 C3 1 665.21565
6 C3 2 290.66596
7 C3 3 865.56265
8 C3 4 6785.03586
9 C3 5 312.02617
11 C3 7 1155.76758
13 C4 2 3340.36540
14 C4 3 2656.73030
15 C4 4 1079.46098
16 C5 1 60.57039
17 C6 1 6282.48118
18 C6 2 7419.32720
19 C7 1 644.90571
20 C8 1 58332.34945
comps <- levels(df$companycode)
trn <- sample(comps, length(comps)*0.9)
df.trn <- subset(df, companycode %in% trn)
df.tst <- subset(df, !(companycode %in% trn))
This splits your data so that 90% of companies are in the training set and the rest in the test set.
This does not guarantee that 90% of your rows will be training and 10% test. The rigorous way to achieve this is left as an exercise for the reader. The non-rigorous way would be to repeat the sampling until you get proportions that are roughly correct.