What does negative %IncMSE in RandomForest package mean?

前端 未结 1 1778
你的背包
你的背包 2021-02-04 11:14

I used RandomForest for a regression problem. I used importance(rf,type=1) to get the %IncMSE for the variables and one of them has a negative %IncMSE. Does this m

1条回答
  •  说谎
    说谎 (楼主)
    2021-02-04 11:32

    Question 1 - why does ntree show 1?:

    summary(rf) shows you the length of the objects that are included in your rf variable. That means that rf$ntree is of length 1. If you type on your console rf$tree you will see that it shows 800.

    Question 2 - does a negative %IncMSE show a "bad" variable?

    IncMSE:
    The way this is calculated is by computing the MSE of the whole model initially. Let's call this MSEmod. After this for each one of the variables (columns in your data set) the values are randomly shuffled (permuted) so that a "bad" variable is being created and a new MSE is being calculated. I.e. imagine for that for one column you had rows 1,2,3,4,5. After the permutation these will end up being 4,3,1,2,5 at random. After the permutation (all of the other columns remain exactly the same since we want to examine col1's importance), the new MSE of the model is being calculated, let's call it MSEcol1 (in a similar manner you will have MSEcol2, MSEcol3 but let's keep it simple and only deal with MSEcol1 here). We would expect that since the second MSE was created using a variable completely random, MSEcol1 would be higher than MSEmod (the higher the MSE the worse). Therefore, when we take the difference of the two MSEcol1 - MSEmod we usually expect a positive number. In your case a negative number shows that the random variable worked better, which shows that it probably the variable is not predictive enough i.e. not important.

    Keep in mind that this description I gave you is the high level, in reality the two MSE values are scaled and the percentage difference is being calculated. But the high level story is this.

    In algorithm form:

    1. Compute model MSE
    2. For each variable in the model:
      • Permute variable
      • Calculate new model MSE according to variable permutation
      • Take the difference between model MSE and new model MSE
    3. Collect the results in a list
    4. Rank variables' importance according to the value of the %IncMSE. The greater the value the better

    Hope it is clear now!

    0 讨论(0)
提交回复
热议问题