Consistent factor levels for same value over different datasets

后端 未结 1 1372
孤独总比滥情好
孤独总比滥情好 2021-01-06 16:34

I\'m not sure if I completely understand how factors work. So please correct me in an easy to understand way if I\'m wrong.

I always assumed that when doing regress

相关标签:
1条回答
  • 2021-01-06 17:21

    When you use as.factor to convert / coerce a vector into a factor, R takes all unique values of your vector and associates a numerical id to each of them; it also has a default sorting method to decide which value gets 1, 2 etc.

    If you have different vectors which live in a common "universe" of values and you want to convert them into consistent factors (i.e. a value appearing in different vectors / dfs is associated to the same numerical id), do this:

    x <- letters[1:5]
    y <- letters[3:8]
    allvalues <- unique(union(x,y))  # superfluous but I think it adds clarity
    x <- factor(x, levels = allvalues)
    y <- factor(y, levels = allvalues)
    str(x)   # Factor w/ 8 levels "a","b","c","d",..: 1 2 3 4 5
    str(y)   # Factor w/ 8 levels "a","b","c","d",..: 3 4 5 6 7 8
    

    Edit

    A small experiment to show that R is smart enough to recognize factor values in different vectors, even if they had been assigned inconsistent numerical ids:

    y <- sample(1:2, size = 20, replace = T)
    x <- factor(letters[y], levels = c("b","a"))  # so a~2 and b~1
    y <- y + rnorm(0, 0.2, n = 20)
    Set <- data.frame(x = x, y = y)
    fit <- lm(data = Set, y ~ x)
    

    To get descriptions of everything: str(x), str(y), summary(fit).

    So fit is trained to associate x = a (which as a factor has a numerical tag of 2) with the value y ~= 1 and y = b with the value x ~= 2.

    Now let's make a "confusing" test set:

    x2 <- factor(c("a","b"), levels = c("c","d","a","b"))
    str(x2)   # Factor w/ 4 levels "c","d","a","b": 3 4
    

    Let's use predict to see what R makes of it:

    predict(fit, newdata = data.frame(x = x2))
    #        1        2 
    # 1.060569 1.961109 
    

    Which is what we'd expect from R...

    0 讨论(0)
提交回复
热议问题