This question builds on the question that I asked here: Creating data partitions over a selected range of data to be fed into caret::train function for cross-validation).
The data I am working with looks like this:
df <- data.frame(Effect = rep(seq(from = 0.05, to = 1, by = 0.05), each = 5), Time = rep(c(1:20,1:20), each = 5), Replicate = c(1:5))
Essentially what I would like to do is create custom partitions, like those generated by the caret::groupKFold
function but for these folds to be over a specified range (i.e. > 15 days) and for each fold to with-hold one point to be a test set and with all other data to be used for training. This would be repeated at each iteration till every point in the specified range has been used as a test set. @Missuse wrote some code towards this end which gets close to the desired output for this question in the above link.
I would try and show you the desired output but in all honesty the caret::groupKFold functions output confuses me so hopefully the above description will suffice. Happy to try and clarify though!
Here is one way you could create the desired partition using tidyverse
df %>%
mutate(id = row_number()) %>% #create a column called id which will hold the row numbers
filter(Time > 15) %>% #subset data frame according to your description
split(.$id) %>% #split the data frame into lists by id (row number)
map(~ .x %>% select(id) %>% #clean up so it works with indexOut argument in trainControl
unlist %>%
unname) -> folds_cv
EDIT: it seems indexOut
argument does not perform as expected, but the index
argument does so after making folds_cv
one can just get the inverse using setdiff
folds_cv <- lapply(folds_cv, function(x) setdiff(1:nrow(df), x))
and now:
test_control <- trainControl(index = folds_cv,
savePredictions = "final")
quad.lm2 <- train(Time ~ Effect,
data = df,
method = "lm",
trControl = test_control)
with a warning:
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
> quad.lm2
Linear Regression
200 samples
1 predictor
No pre-processing
Resampling: Bootstrapped (50 reps)
Summary of sample sizes: 199, 199, 199, 199, 199, 199, ...
Resampling results:
RMSE Rsquared MAE
3.552714e-16 NaN 3.552714e-16
Tuning parameter 'intercept' was held constant at a value of TRUE
so each re-sample used 199 rows and predicted on 1, repeating for all 50 rows which we wanted to hold out at a time. This can be verified in:
Why Rsquared
is missing I am not sure I will dig a bit deeper.