I used RandomForest for a regression problem. I used importance(rf,type=1)
to get the %IncMSE for the variables and one of them has a negative %IncMSE. Does this m
Question 1 - why does ntree
show 1?:
summary(rf)
shows you the length of the objects that are included in your rf
variable. That means that rf$ntree
is of length 1. If you type on your console rf$tree
you will see that it shows 800.
Question 2 - does a negative %IncMSE show a "bad" variable?
IncMSE:
The way this is calculated is by computing the MSE of the whole model initially. Let's call this MSEmod
. After this for each one of the variables (columns in your data set) the values are randomly shuffled (permuted) so that a "bad" variable is being created and a new MSE is being calculated. I.e. imagine for that for one column you had rows 1,2,3,4,5. After the permutation these will end up being 4,3,1,2,5 at random. After the permutation (all of the other columns remain exactly the same since we want to examine col1's
importance), the new MSE of the model is being calculated, let's call it MSEcol1
(in a similar manner you will have MSEcol2
, MSEcol3
but let's keep it simple and only deal with MSEcol1
here). We would expect that since the second MSE was created using a variable completely random, MSEcol1
would be higher than MSEmod
(the higher the MSE the worse). Therefore, when we take the difference of the two MSEcol1
- MSEmod
we usually expect a positive number. In your case a negative number shows that the random variable worked better, which shows that it probably the variable is not predictive enough i.e. not important.
Keep in mind that this description I gave you is the high level, in reality the two MSE values are scaled and the percentage difference is being calculated. But the high level story is this.
In algorithm form:
Hope it is clear now!