How to debug “contrasts can be applied only to factors with 2 or more levels” error?

后端 未结 3 1488
后悔当初
后悔当初 2020-11-21 23:32

Here are all the variables I\'m working with:

str(ad.train)
$ Date                : Factor w/ 427 levels \"2012-03-24\",\"2012-03-29\",..: 4 7 12 14 19 21 24         


        
相关标签:
3条回答
  • 2020-11-21 23:56

    From my experience ten minutes ago this situation can happen where there are more than one category but with a lot of NAs. Taking the Kaggle Houseprice Dataset as example, if you loaded data and run a simple regression,

    train.df = read.csv('train.csv')
    lm1 = lm(SalePrice ~ ., data = train.df)
    

    you will get same error. I also tried testing the number of levels of each factor, but none of them says it has less than 2 levels.

    cols = colnames(train.df)
    for (col in cols){
      if(is.factor(train.df[[col]])){
        cat(col, ' has ', length(levels(train.df[[col]])), '\n')
      }
    }
    

    So after a long time I used summary(train.df) to see details of each col, and removed some, and it finally worked:

    train.df = subset(train.df, select=-c(Id, PoolQC,Fence, MiscFeature, Alley, Utilities))
    lm1 = lm(SalePrice ~ ., data = train.df)
    

    and removing any one of them the regression fails to run again with same error (which I have tested myself).

    Another way to debug this error with a lot of NAs is, replace each NA with the most common attributes of the column. Note the following method cannot debug where NA is the mode of the column, which I suggest drop these columns or substutite these columns manually, individually rather than applying a function working on the whole dataset like this:

    fill.na.with.mode = function(df){
        cols = colnames(df)
        for (col in cols){
            if(class(df[[col]])=='factor'){
                x = summary(df[[col]])
                mode = names(x[which.max(x)])
                df[[col]][is.na(df[[col]])]=mode
            }
            else{
                df[[col]][is.na(df[[col]])]=0
            }
        }
        return (df)
    }
    

    And above attributes generally have 1400+ NAs and 10 useful values, so you might want to remove these garbage attributes, even they have 3 or 4 levels. I guess a function counting how many NAs in each column will help.

    0 讨论(0)
  • 2020-11-22 00:04

    Introduction

    What a "contrasts error" is has been well explained: you have a factor that only has one level (or less). But in reality this simple fact can be easily obscured because the data that are actually used for model fitting can be very different from what you've passed in. This happens when you have NA in your data, you've subsetted your data, a factor has unused levels, or you've transformed your variables and get NaN somewhere. You are rarely in this ideal situation where a single-level factor can be spotted from str(your_data_frame) directly. Many questions on StackOverflow regarding this error are not reproducible, thus suggestions by people may or may not work. Therefore, although there are by now 118 posts regarding this issue, users still can't find an adaptive solution so that this question is raised again and again. This answer is my attempt, to solve this matter "once for all", or at least to provide a reasonable guide.

    This answer has rich information, so let me first make a quick summary.

    I defined 3 helper functions for you: debug_contr_error, debug_contr_error2, NA_preproc.

    I recommend you use them in the following way.

    1. run NA_preproc to get more complete cases;
    2. run your model, and if you get a "contrasts error", use debug_contr_error2 for debugging.

    Most of the answer shows you step by step how & why these functions are defined. There is probably no harm to skip those development process, but don't skip sections from "Reproducible case studies and Discussions".


    Revised answer

    The original answer works perfectly for OP, and has successfully helped some others. But it had failed somewhere else for lack of adaptiveness. Look at the output of str(ad.train) in the question. OP's variables are numeric or factors; there are no characters. The original answer was for this situation. If you have character variables, although they will be coerced to factors during lm and glm fitting, they won't be reported by the code since they were not provided as factors so is.factor will miss them. In this expansion I will make the original answer both more adaptive.

    Let dat be your dataset passed to lm or glm. If you don't readily have such a data frame, that is, all your variables are scattered in the global environment, you need to gather them into a data frame. The following may not be the best way but it works.

    ## `form` is your model formula, here is an example
    y <- x1 <- x2 <- x3 <- 1:4
    x4 <- matrix(1:8, 4)
    form <- y ~ bs(x1) + poly(x2) + I(1 / x3) + x4
    
    ## to gather variables `model.frame.default(form)` is the easiest way 
    ## but it does too much: it drops `NA` and transforms variables
    ## we want something more primitive
    
    ## first get variable names
    vn <- all.vars(form)
    #[1] "y"  "x1" "x2" "x3" "x4"
    
    ## `get_all_vars(form)` gets you a data frame
    ## but it is buggy for matrix variables so don't use it
    ## instead, first use `mget` to gather variables into a list
    lst <- mget(vn)
    
    ## don't do `data.frame(lst)`; it is buggy with matrix variables
    ## need to first protect matrix variables by `I()` then do `data.frame`
    lst_protect <- lapply(lst, function (x) if (is.matrix(x)) I(x) else x)
    dat <- data.frame(lst_protect)
    str(dat)
    #'data.frame':  4 obs. of  5 variables:
    # $ y : int  1 2 3 4
    # $ x1: int  1 2 3 4
    # $ x2: int  1 2 3 4
    # $ x3: int  1 2 3 4
    # $ x4: 'AsIs' int [1:4, 1:2] 1 2 3 4 5 6 7 8
    
    ## note the 'AsIs' for matrix variable `x4`
    ## in comparison, try the following buggy ones yourself
    str(get_all_vars(form))
    str(data.frame(lst))
    

    Step 0: explicit subsetting

    If you've used the subset argument of lm or glm, start by an explicit subsetting:

    ## `subset_vec` is what you pass to `lm` via `subset` argument
    ## it can either be a logical vector of length `nrow(dat)`
    ## or a shorter positive integer vector giving position index
    ## note however, `base::subset` expects logical vector for `subset` argument
    ## so a rigorous check is necessary here
    if (mode(subset_vec) == "logical") {
      if (length(subset_vec) != nrow(dat)) {
        stop("'logical' `subset_vec` provided but length does not match `nrow(dat)`")
        }
      subset_log_vec <- subset_vec
      } else if (mode(subset_vec) == "numeric") {
      ## check range
      ran <- range(subset_vec)
      if (ran[1] < 1 || ran[2] > nrow(dat)) {
        stop("'numeric' `subset_vec` provided but values are out of bound")
        } else {
        subset_log_vec <- logical(nrow(dat))
        subset_log_vec[as.integer(subset_vec)] <- TRUE
        } 
      } else {
      stop("`subset_vec` must be either 'logical' or 'numeric'")
      }
    dat <- base::subset(dat, subset = subset_log_vec)
    

    Step 1: remove incomplete cases

    dat <- na.omit(dat)
    

    You can skip this step if you've gone through step 0, since subset automatically removes incomplete cases.

    Step 2: mode checking and conversion

    A data frame column is usually an atomic vector, with a mode from the following: "logical", "numeric", "complex", "character", "raw". For regression, variables of different modes are handled differently.

    "logical",   it depends
    "numeric",   nothing to do
    "complex",   not allowed by `model.matrix`, though allowed by `model.frame`
    "character", converted to "numeric" with "factor" class by `model.matrix`
    "raw",       not allowed by `model.matrix`, though allowed by `model.frame`
    

    A logical variable is tricky. It can either be treated as a dummy variable (1 for TRUE; 0 for FALSE) hence a "numeric", or it can be coerced to a two-level factor. It all depends on whether model.matrix thinks a "to-factor" coercion is necessary from the specification of your model formula. For simplicity we can understand it as such: it is always coerced to a factor, but the result of applying contrasts may end up with the same model matrix as if it were handled as a dummy directly.

    Some people may wonder why "integer" is not included. Because an integer vector, like 1:4, has a "numeric" mode (try mode(1:4)).

    A data frame column may also be a matrix with "AsIs" class, but such a matrix must have "numeric" mode.

    Our checking is to produce error when

    • a "complex" or "raw" is found;
    • a "logical" or "character" matrix variable is found;

    and proceed to convert "logical" and "character" to "numeric" of "factor" class.

    ## get mode of all vars
    var_mode <- sapply(dat, mode)
    
    ## produce error if complex or raw is found
    if (any(var_mode %in% c("complex", "raw"))) stop("complex or raw not allowed!")
    
    ## get class of all vars
    var_class <- sapply(dat, class)
    
    ## produce error if an "AsIs" object has "logical" or "character" mode
    if (any(var_mode[var_class == "AsIs"] %in% c("logical", "character"))) {
      stop("matrix variables with 'AsIs' class must be 'numeric'")
      }
    
    ## identify columns that needs be coerced to factors
    ind1 <- which(var_mode %in% c("logical", "character"))
    
    ## coerce logical / character to factor with `as.factor`
    dat[ind1] <- lapply(dat[ind1], as.factor)
    

    Note that if a data frame column is already a factor variable, it will not be included in ind1, as a factor variable has "numeric" mode (try mode(factor(letters[1:4]))).

    step 3: drop unused factor levels

    We won't have unused factor levels for factor variables converted from step 2, i.e., those indexed by ind1. However, factor variables that come with dat might have unused levels (often as the result of step 0 and step 1). We need to drop any possible unused levels from them.

    ## index of factor columns
    fctr <- which(sapply(dat, is.factor))
    
    ## factor variables that have skipped explicit conversion in step 2
    ## don't simply do `ind2 <- fctr[-ind1]`; buggy if `ind1` is `integer(0)`
    ind2 <- if (length(ind1) > 0L) fctr[-ind1] else fctr
    
    ## drop unused levels
    dat[ind2] <- lapply(dat[ind2], droplevels)
    

    step 4: summarize factor variables

    Now we are ready to see what and how many factor levels are actually used by lm or glm:

    ## export factor levels actually used by `lm` and `glm`
    lev <- lapply(dat[fctr], levels)
    
    ## count number of levels
    nl <- lengths(lev)
    

    To make your life easier, I've wrapped up those steps into a function debug_contr_error.

    Input:

    • dat is your data frame passed to lm or glm via data argument;
    • subset_vec is the index vector passed to lm or glm via subset argument.

    Output: a list with

    • nlevels (a list) gives the number of factor levels for all factor variables;
    • levels (a vector) gives levels for all factor variables.

    The function produces a warning, if there are no complete cases or no factor variables to summarize.

    debug_contr_error <- function (dat, subset_vec = NULL) {
      if (!is.null(subset_vec)) {
        ## step 0
        if (mode(subset_vec) == "logical") {
          if (length(subset_vec) != nrow(dat)) {
            stop("'logical' `subset_vec` provided but length does not match `nrow(dat)`")
            }
          subset_log_vec <- subset_vec
          } else if (mode(subset_vec) == "numeric") {
          ## check range
          ran <- range(subset_vec)
          if (ran[1] < 1 || ran[2] > nrow(dat)) {
            stop("'numeric' `subset_vec` provided but values are out of bound")
            } else {
            subset_log_vec <- logical(nrow(dat))
            subset_log_vec[as.integer(subset_vec)] <- TRUE
            } 
          } else {
          stop("`subset_vec` must be either 'logical' or 'numeric'")
          }
        dat <- base::subset(dat, subset = subset_log_vec)
        } else {
        ## step 1
        dat <- stats::na.omit(dat)
        }
      if (nrow(dat) == 0L) warning("no complete cases")
      ## step 2
      var_mode <- sapply(dat, mode)
      if (any(var_mode %in% c("complex", "raw"))) stop("complex or raw not allowed!")
      var_class <- sapply(dat, class)
      if (any(var_mode[var_class == "AsIs"] %in% c("logical", "character"))) {
        stop("matrix variables with 'AsIs' class must be 'numeric'")
        }
      ind1 <- which(var_mode %in% c("logical", "character"))
      dat[ind1] <- lapply(dat[ind1], as.factor)
      ## step 3
      fctr <- which(sapply(dat, is.factor))
      if (length(fctr) == 0L) warning("no factor variables to summary")
      ind2 <- if (length(ind1) > 0L) fctr[-ind1] else fctr
      dat[ind2] <- lapply(dat[ind2], base::droplevels.factor)
      ## step 4
      lev <- lapply(dat[fctr], base::levels.default)
      nl <- lengths(lev)
      ## return
      list(nlevels = nl, levels = lev)
      }
    

    Here is a constructed tiny example.

    dat <- data.frame(y = 1:4,
                      x = c(1:3, NA),
                      f1 = gl(2, 2, labels = letters[1:2]),
                      f2 = c("A", "A", "A", "B"),
                      stringsAsFactors = FALSE)
    
    #  y  x f1 f2
    #1 1  1  a  A
    #2 2  2  a  A
    #3 3  3  b  A
    #4 4 NA  b  B
    
    str(dat)
    #'data.frame':  4 obs. of  4 variables:
    # $ y : int  1 2 3 4
    # $ x : int  1 2 3 NA
    # $ f1: Factor w/ 2 levels "a","b": 1 1 2 2
    # $ f2: chr  "A" "A" "A" "B"
    
    lm(y ~ x + f1 + f2, dat)
    #Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
    #  contrasts can be applied only to factors with 2 or more levels
    

    Good, we see an error. Now my debug_contr_error exposes that f2 ends up with a single level.

    debug_contr_error(dat)
    #$nlevels
    #f1 f2 
    # 2  1 
    #
    #$levels
    #$levels$f1
    #[1] "a" "b"
    #
    #$levels$f2
    #[1] "A"
    

    Note that the original short answer is hopeless here, as f2 is provided as a character variable not a factor variable.

    ## old answer
    tmp <- na.omit(dat)
    fctr <- lapply(tmp[sapply(tmp, is.factor)], droplevels)
    sapply(fctr, nlevels)
    #f1 
    # 2 
    rm(tmp, fctr)
    

    Now let's see an example with a matrix variable x.

    dat <- data.frame(X = I(rbind(matrix(1:6, 3), NA)),
                      f = c("a", "a", "a", "b"),
                      y = 1:4)
    
    dat
    #  X.1 X.2 f y
    #1   1   4 a 1
    #2   2   5 a 2
    #3   3   6 a 3
    #4  NA  NA b 4
    
    str(dat)
    #'data.frame':  4 obs. of  3 variables:
    # $ X: 'AsIs' int [1:4, 1:2] 1 2 3 NA 4 5 6 NA
    # $ f: Factor w/ 2 levels "a","b": 1 1 1 2
    # $ y: int  1 2 3 4
    
    lm(y ~ X + f, data = dat)
    #Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
    #  contrasts can be applied only to factors with 2 or more levels
    
    debug_contr_error(dat)$nlevels
    #f 
    #1
    

    Note that a factor variable with no levels can cause an "contrasts error", too. You may wonder how a 0-level factor is possible. Well it is legitimate: nlevels(factor(character(0))). Here you will end up with a 0-level factors if you have no complete cases.

    dat <- data.frame(y = 1:4,
                      x = rep(NA_real_, 4),
                      f1 = gl(2, 2, labels = letters[1:2]),
                      f2 = c("A", "A", "A", "B"),
                      stringsAsFactors = FALSE)
    
    lm(y ~ x + f1 + f2, dat)
    #Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
    #  contrasts can be applied only to factors with 2 or more levels
    
    debug_contr_error(dat)$nlevels
    #f1 f2 
    # 0  0    ## all values are 0
    #Warning message:
    #In debug_contr_error(dat) : no complete cases
    

    Finally let's see some a situation where if f2 is a logical variable.

    dat <- data.frame(y = 1:4,
                      x = c(1:3, NA),
                      f1 = gl(2, 2, labels = letters[1:2]),
                      f2 = c(TRUE, TRUE, TRUE, FALSE))
    
    dat
    #  y  x f1    f2
    #1 1  1  a  TRUE
    #2 2  2  a  TRUE
    #3 3  3  b  TRUE
    #4 4 NA  b FALSE
    
    str(dat)
    #'data.frame':  4 obs. of  4 variables:
    # $ y : int  1 2 3 4
    # $ x : int  1 2 3 NA
    # $ f1: Factor w/ 2 levels "a","b": 1 1 2 2
    # $ f2: logi  TRUE TRUE TRUE FALSE
    

    Our debugger will predict a "contrasts error", but will it really happen?

    debug_contr_error(dat)$nlevels
    #f1 f2 
    # 2  1 
    

    No, at least this one does not fail (the NA coefficient is due to the rank-deficiency of the model; don't worry):

    lm(y ~ x + f1 + f2, data = dat)
    #Coefficients:
    #(Intercept)            x          f1b       f2TRUE  
    #          0            1            0           NA
    

    It is difficult for me to come up with an example giving an error, but there is also no need. In practice, we don't use the debugger for prediction; we use it when we really get an error; and in that case, the debugger can locate the offending factor variable.

    Perhaps some may argue that a logical variable is no different to a dummy. But try the simple example below: it does depends on your formula.

    u <- c(TRUE, TRUE, FALSE, FALSE)
    v <- c(1, 1, 0, 0)  ## "numeric" dummy of `u`
    
    model.matrix(~ u)
    #  (Intercept) uTRUE
    #1           1     1
    #2           1     1
    #3           1     0
    #4           1     0
    
    model.matrix(~ v)
    #  (Intercept) v
    #1           1 1
    #2           1 1
    #3           1 0
    #4           1 0
    
    model.matrix(~ u - 1)
    #  uFALSE uTRUE
    #1      0     1
    #2      0     1
    #3      1     0
    #4      1     0
    
    model.matrix(~ v - 1)
    #  v
    #1 1
    #2 1
    #3 0
    #4 0
    

    More flexible implementation using "model.frame" method of lm

    You are also advised to go through R: how to debug "factor has new levels" error for linear model and prediction, which explains what lm and glm do under the hood on your dataset. You will understand that steps 0 to 4 listed above are just trying to mimic such internal process. Remember, the data that are actually used for model fitting can be very different from what you've passed in.

    Our steps are not completely consistent with such internal processing. For a comparison, you can retrieve the result of the internal processing by using method = "model.frame" in lm and glm. Try this on the previously constructed tiny example dat where f2 is a character variable.

    dat_internal <- lm(y ~ x + f1 + f2, dat, method = "model.frame")
    
    dat_internal
    #  y x f1 f2
    #1 1 1  a  A
    #2 2 2  a  A
    #3 3 3  b  A
    
    str(dat_internal)
    #'data.frame':  3 obs. of  4 variables:
    # $ y : int  1 2 3
    # $ x : int  1 2 3
    # $ f1: Factor w/ 2 levels "a","b": 1 1 2
    # $ f2: chr  "A" "A" "A"
    ## [.."terms" attribute is truncated..]
    

    In practice, model.frame will only perform step 0 and step 1. It also drops variables provided in your dataset but not in your model formula. So a model frame may have both fewer rows and columns than what you feed lm and glm. Type coercing as done in our step 2 is done by the later model.matrix where a "contrasts error" may be produced.

    There are a few advantages to first get this internal model frame, then pass it to debug_contr_error (so that it only essentially performs steps 2 to 4).

    advantage 1: variables not used in your model formula are ignored

    ## no variable `f1` in formula
    dat_internal <- lm(y ~ x + f2, dat, method = "model.frame")
    
    ## compare the following
    debug_contr_error(dat)$nlevels
    #f1 f2 
    # 2  1 
    
    debug_contr_error(dat_internal)$nlevels
    #f2 
    # 1 
    

    advantage 2: able to cope with transformed variables

    It is valid to transform variables in the model formula, and model.frame will record the transformed ones instead of the original ones. Note that, even if your original variable has no NA, the transformed one can have.

    dat <- data.frame(y = 1:4, x = c(1:3, -1), f = rep(letters[1:2], c(3, 1)))
    #  y  x f
    #1 1  1 a
    #2 2  2 a
    #3 3  3 a
    #4 4 -1 b
    
    lm(y ~ log(x) + f, data = dat)
    #Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
    #  contrasts can be applied only to factors with 2 or more levels
    #In addition: Warning message:
    #In log(x) : NaNs produced
    
    # directly using `debug_contr_error` is hopeless here
    debug_contr_error(dat)$nlevels
    #f 
    #2 
    
    ## this works
    dat_internal <- lm(y ~ log(x) + f, data = dat, method = "model.frame")
    #  y    log(x) f
    #1 1 0.0000000 a
    #2 2 0.6931472 a
    #3 3 1.0986123 a
    
    debug_contr_error(dat_internal)$nlevels
    #f 
    #1
    

    Given these benefits, I write another function wrapping up model.frame and debug_contr_error.

    Input:

    • form is your model formula;
    • dat is the dataset passed to lm or glm via data argument;
    • subset_vec is the index vector passed to lm or glm via subset argument.

    Output: a list with

    • mf (a data frame) gives the model frame (with "terms" attribute dropped);
    • nlevels (a list) gives the number of factor levels for all factor variables;
    • levels (a vector) gives levels for all factor variables.

    ## note: this function relies on `debug_contr_error`
    debug_contr_error2 <- function (form, dat, subset_vec = NULL) {
      ## step 0
      if (!is.null(subset_vec)) {
        if (mode(subset_vec) == "logical") {
          if (length(subset_vec) != nrow(dat)) {
            stop("'logical' `subset_vec` provided but length does not match `nrow(dat)`")
            }
          subset_log_vec <- subset_vec
          } else if (mode(subset_vec) == "numeric") {
          ## check range
          ran <- range(subset_vec)
          if (ran[1] < 1 || ran[2] > nrow(dat)) {
            stop("'numeric' `subset_vec` provided but values are out of bound")
            } else {
            subset_log_vec <- logical(nrow(dat))
            subset_log_vec[as.integer(subset_vec)] <- TRUE
            } 
          } else {
          stop("`subset_vec` must be either 'logical' or 'numeric'")
          }
        dat <- base::subset(dat, subset = subset_log_vec)
        }
      ## step 0 and 1
      dat_internal <- stats::lm(form, data = dat, method = "model.frame")
      attr(dat_internal, "terms") <- NULL
      ## rely on `debug_contr_error` for steps 2 to 4
      c(list(mf = dat_internal), debug_contr_error(dat_internal, NULL))
      }
    

    Try the previous log transform example.

    debug_contr_error2(y ~ log(x) + f, dat)
    #$mf
    #  y    log(x) f
    #1 1 0.0000000 a
    #2 2 0.6931472 a
    #3 3 1.0986123 a
    #
    #$nlevels
    #f 
    #1 
    #
    #$levels
    #$levels$f
    #[1] "a"
    #
    #
    #Warning message:
    #In log(x) : NaNs produced
    

    Try subset_vec as well.

    ## or: debug_contr_error2(y ~ log(x) + f, dat, c(T, F, T, T))
    debug_contr_error2(y ~ log(x) + f, dat, c(1,3,4))
    #$mf
    #  y   log(x) f
    #1 1 0.000000 a
    #3 3 1.098612 a
    #
    #$nlevels
    #f 
    #1 
    #
    #$levels
    #$levels$f
    #[1] "a"
    #
    #
    #Warning message:
    #In log(x) : NaNs produced
    

    Model fitting per group and NA as factor levels

    If you are fitting model by group, you are more likely to get a "contrasts error". You need to

    1. split your data frame by the grouping variable (see ?split.data.frame);
    2. work through those data frames one by one, applying debug_contr_error2 (lapply function can be helpful to do this loop).

    Some also told me that they can not use na.omit on their data, because it will end up too few rows to do anything sensible. This can be relaxed. In practice it is the NA_integer_ and NA_real_ that have to be omitted, but NA_character_ can be retained: just add NA as a factor level. To achieve this, you need to loop through variables in your data frame:

    • if a variable x is already a factor and anyNA(x) is TRUE, do x <- addNA(x). The "and" is important. If x has no NA, addNA(x) will add an unused <NA> level.
    • if a variable x is a character, do x <- factor(x, exclude = NULL) to coerce it to a factor. exclude = NULL will retain <NA> as a level.
    • if x is "logical", "numeric", "raw" or "complex", nothing should be changed. NA is just NA.

    <NA> factor level will not be dropped by droplevels or na.omit, and it is valid for building a model matrix. Check the following examples.

    ## x is a factor with NA
    
    x <- factor(c(letters[1:4], NA))  ## default: `exclude = NA`
    #[1] a    b    c    d    <NA>     ## there is an NA value
    #Levels: a b c d                  ## but NA is not a level
    
    na.omit(x)  ## NA is gone
    #[1] a b c d
    #[.. attributes truncated..]
    #Levels: a b c d
    
    x <- addNA(x)  ## now add NA into a valid level
    #[1] a    b    c    d    <NA>
    #Levels: a b c d <NA>  ## it appears here
    
    droplevels(x)    ## it can not be dropped
    #[1] a    b    c    d    <NA>
    #Levels: a b c d <NA>
    
    na.omit(x)  ## it is not omitted
    #[1] a    b    c    d    <NA>
    #Levels: a b c d <NA>
    
    model.matrix(~ x)   ## and it is valid to be in a design matrix
    #  (Intercept) xb xc xd xNA
    #1           1  0  0  0   0
    #2           1  1  0  0   0
    #3           1  0  1  0   0
    #4           1  0  0  1   0
    #5           1  0  0  0   1
    

    ## x is a character with NA
    
    x <- c(letters[1:4], NA)
    #[1] "a" "b" "c" "d" NA 
    
    as.factor(x)  ## this calls `factor(x)` with default `exclude = NA`
    #[1] a    b    c    d    <NA>     ## there is an NA value
    #Levels: a b c d                  ## but NA is not a level
    
    factor(x, exclude = NULL)      ## we want `exclude = NULL`
    #[1] a    b    c    d    <NA>
    #Levels: a b c d <NA>          ## now NA is a level
    

    Once you add NA as a level in a factor / character, your dataset might suddenly have more complete cases. Then you can run your model. If you still get a "contrasts error", use debug_contr_error2 to see what has happened.

    For your convenience, I write a function for this NA preprocessing.

    Input:

    • dat is your full dataset.

    Output:

    • a data frame, with NA added as a level for factor / character.

    NA_preproc <- function (dat) {
      for (j in 1:ncol(dat)) {
        x <- dat[[j]]
        if (is.factor(x) && anyNA(x)) dat[[j]] <- base::addNA(x)
        if (is.character(x)) dat[[j]] <- factor(x, exclude = NULL)
        }
      dat
      }
    

    Reproducible case studies and Discussions

    The followings are specially selected for reproducible case studies, as I just answered them with the three helper functions created here.

    • How to do a GLM when "contrasts can be applied only to factors with 2 or more levels"?
    • R: Error in contrasts when fitting linear models with `lm`

    There are also a few other good-quality threads solved by other StackOverflow users:

    • Factors not being recognised in a lm using map() (this is about model fitting by group)
    • How to drop NA observation of factors conditionally when doing linear regression in R? (this is similar to case 1 in the previous list)
    • Factor/level error in mixed model (another post about model fitting by group)

    This answer aims to debug the "contrasts error" during model fitting. However, this error can also turn up when using predict for prediction. Such behavior is not with predict.lm or predict.glm, but with predict methods from some packages. Here are a few related threads on StackOverflow.

    • Prediction in R - GLMM
    • Error in `contrasts' Error
    • SVM predict on dataframe with different factor levels
    • Using predict with svyglm
    • must a dataset contain all factors in SVM in R
    • Probability predictions with cumulative link mixed models
    • must a dataset contain all factors in SVM in R

    Also note that the philosophy of this answer is based on that of lm and glm. These two functions are a coding standard for many model fitting routines, but maybe not all model fitting routines behave similarly. For example, the following does not look transparent to me whether my helper functions would actually be helpful.

    • Error with svychisq - 'contrast can be applied to factors with 2 or more levels'
    • R packages effects & plm : "error in contrasts" when trying to plot marginal effects
    • Contrasts can be applied only to factor
    • R: lawstat::levene.test fails while Fligner Killeen works, as well as car::leveneTest
    • R - geeglm Error: contrasts can be applied only to factors with 2 or more levels

    Although a bit off-topic, it is still useful to know that sometimes a "contrasts error" merely comes from writing a wrong piece of code. In the following examples, OP passed the name of their variables rather than their values to lm. Since a name is a single value character, it is later coerced to a single-level factor and causes the error.

    • Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels
    • Loop through a character vector to use in a function

    How to resolve this error after debugging?

    In practice people want to know how to resolve this matter, either at a statistical level or a programming level.

    If you are fitting models on your complete dataset, then there is probably no statistical solution, unless you can impute missing values or collect more data. Thus you may simply turn to a coding solution to drop the offending variable. debug_contr_error2 returns nlevels which helps you easily locate them. If you don't want to drop them, replace them by a vector of 1 (as explained in How to do a GLM when "contrasts can be applied only to factors with 2 or more levels"?) and let lm or glm deal with the resulting rank-deficiency.

    If you are fitting models on subset, there can be statistical solutions.

    Fitting models by group does not necessarily require you splitting your dataset by group and fitting independent models. The following may give you a rough idea:

    • R regression analysis: analyzing data for a certain ethnicity
    • Finding the slope for multiple points in selected columns
    • R: build separate models for each category

    If you do split your data explicitly, you can easily get "contrasts error", thus have to adjust your model formula per group (that is, you need to dynamically generate model formulae). A simpler solution is to skip building a model for this group.

    You may also randomly partition your dataset into a training subset and a testing subset so that you can do cross-validation. R: how to debug "factor has new levels" error for linear model and prediction briefly mentions this, and you'd better do a stratified sampling to ensure the success of both model estimation on the training part and prediction on the testing part.

    0 讨论(0)
  • 2020-11-22 00:09

    Perhaps as a very quick step one is to verify that you do indeed have at least 2 factors. The quick way I found was:

    df %>% dplyr::mutate_all(as.factor) %>% str
    
    0 讨论(0)
提交回复
热议问题