understanding max_feature in random forest

扶醉桌前 提交于 2021-01-29 22:08:07

问题


I got a question when training the forest. I used a 5-fold cross validation and rmse as guideline to figure out the best parameter for the model. I eventually find that when the max_feature=1, I got the smallest rmse. That's strange to me, since max_feature is the feature considered at each split. Generally, if I want to find the "best" parameter to lowest the impurity in splitting, the tree should, at best, consider all the features and find the one result in lowest impurity after splitting. However, in terms of the cross validation, I get max_feature=1 to be lowest the rmse. Is that due to the more feature considered, the lower generality the tree becomes? Thanks


回答1:


So, the idea of a random forest is, that a single Decision Tree has a large variance but a low bias (it is overfitting). We then create different trees to reduce that variance

Let X_i be the trees in the forest. Assume each tree is i.d with mean mu and variance sigma, and let the prediction be of the mean of all X_i. We assume that all X's are not independent (since they share some of the same training data and the same features) and positive correlated with some constant p. We can write the variance of the mean (the prediction) as :

where n is the number of trees.

Since everything but p is fixed you want to reduce p as much as possible i.e the correlation between trees and if you use all the same features for each split it is very likely that you end up with some correlated ("identical" trees) thus a high variance (eventhough you CV it).

With that in mind it is not strange that max_feature=1 is the optimal choice since the trees grown are very unlikely to be identical (or alike).

It is just the classic "bias-variance trade-off".

EDIT: The proof for the formular



来源:https://stackoverflow.com/questions/63406684/understanding-max-feature-in-random-forest

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!