问题
I got a question when training the forest. I used a 5-fold cross validation and rmse as guideline to figure out the best parameter for the model. I eventually find that when the max_feature=1, I got the smallest rmse. That's strange to me, since max_feature is the feature considered at each split. Generally, if I want to find the "best" parameter to lowest the impurity in splitting, the tree should, at best, consider all the features and find the one result in lowest impurity after splitting. However, in terms of the cross validation, I get max_feature=1 to be lowest the rmse. Is that due to the more feature considered, the lower generality the tree becomes? Thanks
回答1:
So, the idea of a random forest is, that a single Decision Tree has a large variance but a low bias (it is overfitting). We then create different trees to reduce that variance
Let X_i
be the trees in the forest. Assume each tree is i.d with mean mu
and variance sigma
, and let the prediction be of the mean of all X_i
. We assume that all X's are not independent (since they share some of the same training data and the same features) and positive correlated with some constant p
. We can write the variance of the mean (the prediction) as :
where n
is the number of trees.
Since everything but p
is fixed you want to reduce p
as much as possible i.e the correlation between trees and if you use all the same features for each split it is very likely that you end up with some correlated ("identical" trees) thus a high variance (eventhough you CV it).
With that in mind it is not strange that max_feature=1
is the optimal choice since the trees grown are very unlikely to be identical (or alike).
It is just the classic "bias-variance trade-off".
EDIT: The proof for the formular
来源:https://stackoverflow.com/questions/63406684/understanding-max-feature-in-random-forest