how randomForest package in R interprets character variables

不羁岁月 提交于 2021-01-29 14:26:58

问题


This post is correlated with: How R automatically coerces character input to numeric?

I am a user of the randomForest package. I just have a quick question: Can anyone let me know or refer me to the somewhere in the source code that how the randomForest package in R takes/treats character variables? I have used character variables as direct input and I also converted the character variables to factors as input, but the performances are different.

Hope for a quick answer or a reference to somewhere in the source code to my problem.

I am using R 4.0.1., and as far I know, an earlier version of R, when you use randomForest package, it won't take character variables and will turn an error.

Thank you very much!!

Edit: here is part of the code: The difference is shown in the MAE and MAPE measurement part and the cause of the difference is if I use as.factor for the character features. (not shown here)

rf <- randomForest(
volume ~ .,
data = master_train,
ntree = 500,
mtry = 15,
nodesize = 50,
maxnodes = 100,
#sampsize = 10000
#replace=T,
#nPerm = 5
importance = T,
#proximity=TRUE,
#keep.forest=FALSE
)

pred1 = predict(rf, newdata = master_train)
pred2 = predict(rf, newdata = master_test)

pred1 = predict(rf, newdata = master_train)
pred2 = predict(rf, newdata = master_test)

pred1[is.na(pred1)] <- 0
pred2[is.na(pred2)] <- 0

mean(abs(pred1 - master_train$volume))
mean(abs(pred2 - master_test$volume))

mean(abs(pred1 - master_train$volume))/mean(master_train$volume)
mean(abs(pred2 - master_test$volume))/mean(master_test$volume)

An update: I checked using

getTree(mod,1,labelVar=TRUE)

And I can see that if those character variables are converted to factors, then the "split point" in the output is an integer (which means it is a categorical variable (see: https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/getTree)). But if not converted to factors, then the "split point" in the output is not integer.

So I guess is that R coerces the values of those character variables into numeric values? But how? -- then it is the topic in the another thread I mentioned at the beginning.

来源:https://stackoverflow.com/questions/63186926/how-randomforest-package-in-r-interprets-character-variables

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!