问题
I am getting the following error when using recipes::step_dummy with caret::train (first attempt at combining the two packages):
Error: Not all variables in the recipe are present in the supplied training set
Not sure what is causing the error nor the best way to debug. Help to train model would be much appreciated.
library(caret)
library(tidyverse)
library(recipes)
library(rsample)
data("credit_data")
## Split the data into training (75%) and test sets (25%)
set.seed(100)
train_test_split <- initial_split(credit_data)
credit_train <- training(train_test_split)
credit_test <- testing(train_test_split)
# Create recipe for data pre-processing
rec_obj <- recipe(Status ~ ., data = credit_train) %>%
step_knnimpute(all_predictors()) %>%
#step_other(Home, Marital, threshold = .2, other = "other") %>%
#step_other(Job, threshold = .2, other = "others") %>%
step_dummy(Records) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep(training = credit_train, retain = TRUE)
train_data <- juice(rec_obj)
test_data <- bake(rec_obj, credit_test)
set.seed(1055)
# the glm function models the second factor level.
lrfit <- train(rec_obj, data = train_data,
method = "glm",
trControl = trainControl(method = "repeatedcv",
repeats = 5))
回答1:
Don't prep the recipe before giving it to train
and use the original training set:
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
library(tidyverse)
library(recipes)
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#>
#> fixed
#> The following object is masked from 'package:stats':
#>
#> step
library(rsample)
data("credit_data")
## Split the data into training (75%) and test sets (25%)
set.seed(100)
train_test_split <- initial_split(credit_data)
credit_train <- training(train_test_split)
credit_test <- testing(train_test_split)
# Create recipe for data pre-processing
rec_obj <-
recipe(Status ~ ., data = credit_train) %>%
step_knnimpute(all_predictors()) %>%
#step_other(Home, Marital, threshold = .2, other = "other") %>%
#step_other(Job, threshold = .2, other = "others") %>%
step_dummy(Records) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric())
set.seed(1055)
# the glm function models the second factor level.
lrfit <- train(rec_obj, data = credit_train,
method = "glm",
trControl = trainControl(method = "repeatedcv",
repeats = 5))
lrfit
#> Generalized Linear Model
#>
#> 3341 samples
#> 13 predictor
#> 2 classes: 'bad', 'good'
#>
#> Recipe steps: knnimpute, dummy, center, scale
#> Resampling: Cross-Validated (10 fold, repeated 5 times)
#> Summary of sample sizes: 3006, 3008, 3007, 3007, 3007, 3007, ...
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.7965349 0.4546223
Created on 2019-03-20 by the reprex package (v0.2.1)
回答2:
It seems that you still need the formula in the train function (despite being listed in the recipe)?...
glmfit <- train(Status ~ ., data = juice(rec_obj),
method = "glm",
trControl = trainControl(method = "repeatedcv", repeats = 5))
来源:https://stackoverflow.com/questions/55132850/recipesstep-dummy-carettrain-errornot-all-variables-in-the-recipe-are