I\'m not sure if I completely understand how factors work. So please correct me in an easy to understand way if I\'m wrong.
I always assumed that when doing regress
When you use as.factor
to convert / coerce a vector into a factor, R takes all unique values of your vector and associates a numerical id to each of them; it also has a default sorting method to decide which value gets 1, 2 etc.
If you have different vectors which live in a common "universe" of values and you want to convert them into consistent factors (i.e. a value appearing in different vectors / dfs is associated to the same numerical id), do this:
x <- letters[1:5]
y <- letters[3:8]
allvalues <- unique(union(x,y)) # superfluous but I think it adds clarity
x <- factor(x, levels = allvalues)
y <- factor(y, levels = allvalues)
str(x) # Factor w/ 8 levels "a","b","c","d",..: 1 2 3 4 5
str(y) # Factor w/ 8 levels "a","b","c","d",..: 3 4 5 6 7 8
Edit
A small experiment to show that R is smart enough to recognize factor values in different vectors, even if they had been assigned inconsistent numerical ids:
y <- sample(1:2, size = 20, replace = T)
x <- factor(letters[y], levels = c("b","a")) # so a~2 and b~1
y <- y + rnorm(0, 0.2, n = 20)
Set <- data.frame(x = x, y = y)
fit <- lm(data = Set, y ~ x)
To get descriptions of everything: str(x)
, str(y)
, summary(fit)
.
So fit
is trained to associate x = a
(which as a factor has a numerical tag of 2) with the value y ~= 1
and y = b
with the value x ~= 2
.
Now let's make a "confusing" test set:
x2 <- factor(c("a","b"), levels = c("c","d","a","b"))
str(x2) # Factor w/ 4 levels "c","d","a","b": 3 4
Let's use predict
to see what R makes of it:
predict(fit, newdata = data.frame(x = x2))
# 1 2
# 1.060569 1.961109
Which is what we'd expect from R...