Gidday,
I'm looking for a way to randomly split a data frame (e.g. 90/10 split) for testing and training of a model keeping a certain grouping criteria.
Imagine I have a data frame like this:
> test[1:20,]
companycode year expenses
1 C1 1 8.47720
2 C1 2 8.45250
3 C1 3 8.46280
4 C2 1 14828.90603
5 C3 1 665.21565
6 C3 2 290.66596
7 C3 3 865.56265
8 C3 4 6785.03586
9 C3 5 312.02617
10 C3 6 760.48740
11 C3 7 1155.76758
12 C4 1 4565.78313
13 C4 2 3340.36540
14 C4 3 2656.73030
15 C4 4 1079.46098
16 C5 1 60.57039
17 C6 1 6282.48118
18 C6 2 7419.32720
19 C7 1 644.90571
20 C8 1 58332.34945
What I'm trying to do is to split this data frame into a training and a testing set using a defined splitting criterion. Using the provided data, I want to split the data in a way that the companies are not mixed up in both data frames. Data set 1 contains different companies than data set 2.
Imagine a 90/10 split, an ideal split would look like this:
> data_90split
companycode year expenses
4 C2 1 14828.90603
12 C4 1 4565.78313
13 C4 2 3340.36540
14 C4 3 2656.73030
15 C4 4 1079.46098
16 C5 1 60.57039
5 C3 1 665.21565
6 C3 2 290.66596
7 C3 3 865.56265
8 C3 4 6785.03586
9 C3 5 312.02617
10 C3 6 760.48740
11 C3 7 1155.76758
17 C6 1 6282.48118
18 C6 2 7419.32720
1 C1 1 8.47720
2 C1 2 8.45250
3 C1 3 8.46280
> data_10split
companycode year expenses
20 C8 1 58332.34945
19 C7 1 644.90571
I hope I pointed out clearly what I'm looking for. Thanks for your feedback.
comps <- levels(df$companycode)
trn <- sample(comps, length(comps)*0.9)
df.trn <- subset(df, companycode %in% trn)
df.tst <- subset(df, !(companycode %in% trn))
This splits your data so that 90% of companies are in the training set and the rest in the test set.
This does not guarantee that 90% of your rows will be training and 10% test. The rigorous way to achieve this is left as an exercise for the reader. The non-rigorous way would be to repeat the sampling until you get proportions that are roughly correct.
Assuming no conditions on what groups you want, the following will randomly split your data frame into 90% and 10% partitions (stored in a list):
set.seed(1)
split(test, sample(1:nrow(test) > round(nrow(test) * .1)))
Produces:
$`FALSE`
companycode year expenses
10 C3 6 760.4874
12 C4 1 4565.7831
$`TRUE`
companycode year expenses
1 C1 1 8.47720
2 C1 2 8.45250
3 C1 3 8.46280
4 C2 1 14828.90603
5 C3 1 665.21565
6 C3 2 290.66596
7 C3 3 865.56265
8 C3 4 6785.03586
9 C3 5 312.02617
11 C3 7 1155.76758
13 C4 2 3340.36540
14 C4 3 2656.73030
15 C4 4 1079.46098
16 C5 1 60.57039
17 C6 1 6282.48118
18 C6 2 7419.32720
19 C7 1 644.90571
20 C8 1 58332.34945
来源:https://stackoverflow.com/questions/22518982/randomly-split-data-by-criterion-into-training-and-testing-data-set-using-r