r random forest error - type of predictors in new data do not match

前端 未结 8 1373
挽巷
挽巷 2020-12-04 14:37

I am trying to use quantile regression forest function in R (quantregForest) which is built on Random Forest package. I am getting a type mismatch error that I can\'t quite

相关标签:
8条回答
  • 2020-12-04 15:12
    levels(PredictData$columnName) <- rfmodels$forest$xlevels$columnName
    

    However, this will change the original data in PredictData. Hence following code has to be there

    x<-PredictData
    levels(PredictData$columnName) <- rfmodels$forest$xlevels$columnName
    
    for (i in 1:length(x$columnName))
    {
      PredictData$columnName[i] <- x$columnName[i]
    }
    

    The above piece of code will solve this error.

    0 讨论(0)
  • 2020-12-04 15:17

    @mgoldwasser is right in general, but there is also a very nasty bug in predict.randomForest: Even if you have exactly the same levels in the training and in the prediction set, it is possible to get this error. This is possible when you have a factor where you have embedded NA as a separate level. The problem is that predict.randomForest essentially does the following:

    # Assume your original factor has two "proper" levels + NA level:
    f <- factor(c(0,1,NA), exclude=NULL)
    
    length(levels(f)) # => 3
    levels(f)         # => "0" "1" NA
    
    # Note that
    sum(is.na(f))     # => 0
    # i.e., the values of the factor are not `NA` only the corresponding level is.
    
    # Internally predict.randomForest passes the factor (the one of the training set)
    # through the function `factor(.)`.
    # Unfortunately, it does _not_ do this for the prediction set.
    # See what happens to f if we do that:
    pf <- factor(f)
    
    length(levels(pf)) # => 2
    levels(pf)         # => "0" "1"
    
    # In other words:
    length(levels(f)) != length(levels(factor(f))) 
    # => sad but TRUE
    

    So, it will always discard the NA level from the training set and will always see one additional level in the prediction set.

    A workaround is to replace the value NA of the level before using randomForest:

    levels(f)[is.na(levels(f))] <- "NA"
    levels(f) # => "0"  "1"  "NA"
              #              .... note that this is no longer a plain `NA`
    

    Now calling factor(f) won't discard the level, and the check succeeds.

    0 讨论(0)
  • 2020-12-04 15:24

    I try to use this way to solved and it works.

    get the factor level from the rf model itself directly

    levels(PredictData$columnName) <- rfmodels$forest$xlevels$columnName
    
    0 讨论(0)
  • 2020-12-04 15:28

    I had the same problem. You can try to use small trick to equalize classes of training and test set. Bind the first row of training set to the test set and than delete it. For your example it should look like this:

        xtest <- rbind(xtrain[1, ] , xtest)
        xtest <- xtest[-1,]
    
    0 讨论(0)
  • 2020-12-04 15:28

    I just solved doing the following:

    ## Creating sample data
    values_development=factor(c("a", "b", "c")) ## Values used when building the random forest model
    values_production=factor(c("a", "b", "c", "ooops")) ## New values to used when using the model
    
    ## Deleting cases which were not present when developing
    values_production=sapply(as.character(values_production), function(x) if(x %in% values_development) x else NA)
    
    ## Creating the factor variable, (with the correct NA value level)
    values_production=factor(values_production)
    
    ## Checking
    values_production # =>  a     b     c  <NA> 
    
    0 讨论(0)
  • 2020-12-04 15:29

    Expanding on @user1849895's solution:

    common <- intersect(names(train), names(test)) 
    for (p in common) { 
      if (class(train[[p]]) == "factor") { 
        levels(test[[p]]) <- levels(train[[p]]) 
      } 
    }
    
    0 讨论(0)
提交回复
热议问题