问题
This post is correlated with: How R automatically coerces character input to numeric?
I am a user of the randomForest
package. I just have a quick question: Can anyone let me know or refer me to the somewhere in the source code that how the randomForest
package in R takes/treats character variables? I have used character variables as direct input and I also converted the character variables to factors as input, but the performances are different.
Hope for a quick answer or a reference to somewhere in the source code to my problem.
I am using R 4.0.1., and as far I know, an earlier version of R, when you use randomForest
package, it won't take character variables and will turn an error.
Thank you very much!!
Edit: here is part of the code: The difference is shown in the MAE and MAPE measurement part and the cause of the difference is if I use as.factor
for the character features. (not shown here)
rf <- randomForest(
volume ~ .,
data = master_train,
ntree = 500,
mtry = 15,
nodesize = 50,
maxnodes = 100,
#sampsize = 10000
#replace=T,
#nPerm = 5
importance = T,
#proximity=TRUE,
#keep.forest=FALSE
)
pred1 = predict(rf, newdata = master_train)
pred2 = predict(rf, newdata = master_test)
pred1 = predict(rf, newdata = master_train)
pred2 = predict(rf, newdata = master_test)
pred1[is.na(pred1)] <- 0
pred2[is.na(pred2)] <- 0
mean(abs(pred1 - master_train$volume))
mean(abs(pred2 - master_test$volume))
mean(abs(pred1 - master_train$volume))/mean(master_train$volume)
mean(abs(pred2 - master_test$volume))/mean(master_test$volume)
An update: I checked using
getTree(mod,1,labelVar=TRUE)
And I can see that if those character variables are converted to factors, then the "split point" in the output is an integer (which means it is a categorical variable (see: https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/getTree)). But if not converted to factors, then the "split point" in the output is not integer.
So I guess is that R coerces the values of those character variables into numeric values? But how? -- then it is the topic in the another thread I mentioned at the beginning.
来源:https://stackoverflow.com/questions/63186926/how-randomforest-package-in-r-interprets-character-variables