问题
Sorry for all the purrr related questions today, still trying to figure out how to make efficient use of it.
So with some help from SO I managed to get random forest ranger model running based on input values coming from a data.frame. This is accomplished using purrr::pmap
. However, I don't understand how the return values are generated from the called function. Consider this example:
library(ranger)
data(iris)
Input_list <- list(iris1 = iris, iris2 = iris) # let's assume these are different input tables
# the data.frame with the values for the function
hyper_grid <- expand.grid(
Input_table = names(Input_list),
mtry = c(1,2),
Classification = TRUE,
Target = "Species")
> hyper_grid
Input_table mtry Classification Target
1 iris1 1 TRUE Species
2 iris2 1 TRUE Species
3 iris1 2 TRUE Species
4 iris2 2 TRUE Species
# the function to be called for each row of the `hyper_grid`df
fit_and_extract_metrics <- function(Target, Input_table, Classification, mtry,...) {
RF_train <- ranger(
dependent.variable.name = Target,
mtry = mtry,
data = Input_list[[Input_table]], # referring to the named object in the list
classification = Classification) # otherwise regression is performed
RF_train$confusion.matrix
}
# the pmap call using a row of hyper_grid and the function in parallel
purrr::pmap(hyper_grid, fit_and_extract_metrics)
It is supposed to return 4 times a 3*3 confusion matrix, as there are 3 levels in iris$Species
, instead it returns giant confusion matrices. Can someone explain to me what is going on?
First lines:
> purrr::pmap(hyper_grid, fit_and_extract_metrics)
[[1]]
predicted
true 4.4 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6 6.1 6.2 6.3 6.4
4.3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4.4 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4.5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4.6 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4.7 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4.8 0 0 1 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4.9 0 0 1 2 2 0 0 0 0 0 0 0 0 0 1 0 0 0 0
5 0 0 0 1 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5.1 0 0 0 0 0 8 0 0 0 1 0 0 0 0 0 0 0 0 0
回答1:
The problem here was because the arguments passed to the function were levels, not characters. This tripped up the ranger function. To solve this, all you need to do is set stringsAsFactors = FALSE
in the expand.grid
:
hyper_grid <- expand.grid(
Input_table = names(Input_list),
mtry = c(1,2),
Classification = TRUE,
Target = "Species", stringsAsFactors = FALSE)
You'll get:
[[1]]
predicted
true setosa versicolor virginica
setosa 50 0 0
versicolor 0 46 4
virginica 0 4 46
[[2]]
predicted
true setosa versicolor virginica
setosa 50 0 0
versicolor 0 46 4
virginica 0 5 45
[[3]]
predicted
true setosa versicolor virginica
setosa 50 0 0
versicolor 0 47 3
virginica 0 3 47
[[4]]
predicted
true setosa versicolor virginica
setosa 50 0 0
versicolor 0 47 3
virginica 0 3 47
来源:https://stackoverflow.com/questions/60956516/r-ranger-confusion-matrix-is-larger-than-supposed-when-using-expand-grid-and-pur