Consistent factor levels for same value over different datasets

后端未结

关注

 1  1377

I\'m not sure if I completely understand how factors work. So please correct me in an easy to understand way if I\'m wrong.

I always assumed that when doing regress

相关标签:

1条回答

借酒劲吻你

2021-01-06 17:21
When you use as.factor to convert / coerce a vector into a factor, R takes all unique values of your vector and associates a numerical id to each of them; it also has a default sorting method to decide which value gets 1, 2 etc.

If you have different vectors which live in a common "universe" of values and you want to convert them into consistent factors (i.e. a value appearing in different vectors / dfs is associated to the same numerical id), do this:
```
x <- letters[1:5]
y <- letters[3:8]
allvalues <- unique(union(x,y))  # superfluous but I think it adds clarity
x <- factor(x, levels = allvalues)
y <- factor(y, levels = allvalues)
str(x)   # Factor w/ 8 levels "a","b","c","d",..: 1 2 3 4 5
str(y)   # Factor w/ 8 levels "a","b","c","d",..: 3 4 5 6 7 8
```
Edit

A small experiment to show that R is smart enough to recognize factor values in different vectors, even if they had been assigned inconsistent numerical ids:
```
y <- sample(1:2, size = 20, replace = T)
x <- factor(letters[y], levels = c("b","a"))  # so a~2 and b~1
y <- y + rnorm(0, 0.2, n = 20)
Set <- data.frame(x = x, y = y)
fit <- lm(data = Set, y ~ x)
```
To get descriptions of everything: str(x), str(y), summary(fit).

So fit is trained to associate x = a (which as a factor has a numerical tag of 2) with the value y ~= 1 and y = b with the value x ~= 2.

Now let's make a "confusing" test set:
```
x2 <- factor(c("a","b"), levels = c("c","d","a","b"))
str(x2)   # Factor w/ 4 levels "c","d","a","b": 3 4
```
Let's use predict to see what R makes of it:
```
predict(fit, newdata = data.frame(x = x2))
#        1        2 
# 1.060569 1.961109 
```
Which is what we'd expect from R...
0 讨论(0)
发布评论:

提交评论
- 加载中...