I\'m quite confused on when to use
factor(educ) or factor(agegroup)in R. Is it used for categorical ordered data? or can I just use to i
You can flag a factor as ordered by creating it with ordered(x)
or with factor(x, ordered=TRUE)
. The "Details" section of ?factor
explains that:
Ordered factors differ from factors only in their class, but methods and the model-fitting functions treat the two classes quite differently.
You can confirm the first part of that quote (that they differ only in their class) by comparing the attributes of these two objects:
f <- factor(letters[3:1], levels=letters[3:1])
of <- ordered(letters[3:1], levels=letters[3:1])
attributes(f)
# $levels
# [1] "c" "b" "a"
#
# $class
# [1] "factor"
attributes(of)
# $levels
# [1] "c" "b" "a"
#
# $class
# [1] "ordered" "factor"
Various factor-handling R functions (the "methods and model-fitting functions" of the second part of that quote) will then use is.ordered()
to test for the presence of that "ordered"
class indicator, taking it as a directive to treat an ordered factor differently than an unordered one. Here are a couple of examples:
## The print method for factors. (Type 'print.factor' to see the function's code)
print(f)
# [1] c b a
# Levels: c b a
print(of)
# [1] c b a
# Levels: c < b < a
## The contrasts function. (Type 'contrasts' to see the function's code.)
contrasts(of)
# .L .Q
# [1,] -7.071068e-01 0.4082483
# [2,] 4.350720e-18 -0.8164966
# [3,] 7.071068e-01 0.4082483
contrasts(f)
# b a
# c 0 0
# b 1 0
# a 0 1
I don't really see a clear question here, so perhaps a simple example would suffice as an answer.
Imagine we have the following data.
set1 <- c("AA", "B", "BA", "CC", "CA", "AA", "BA", "CC", "CC")
We want to factor this data.
f.set1 <- factor(set1)
Let's look at the output. Note that R has just alphabetized the levels, but does not say that this implies hierarchy (see the "levels" line).
f.set1
# [1] AA B BA CC CA AA BA CC CC
# Levels: AA B BA CA CC
is.ordered(f.set1)
# [1] FALSE
However, using as.numeric
on the factored data might fool you into thinking it is hierarchical. Note that "5" comes before "4" in the output below, and note also the alphabetized output of table(f.set1)
(which also happens if you simply did table(set1)
.
as.numeric(f.set1)
# [1] 1 2 3 5 4 1 3 5 5
table(f.set1)
# f.set1
# AA B BA CA CC
# 2 1 2 1 3
Let's now compare this with what happens when we use the ordered
argument along with the levels
argument. Using levels
plus ordered = TRUE
tells us that this categorical data is hierarchical, in the order specified by levels
(not alphabetically or in the order that we've entered the data).
o.set1 <- factor(set1,
levels = c("CA", "BA", "AA", "CC", "B"),
ordered = TRUE)
Even viewing the output shows us hierarchy now.
o.set1
# [1] AA B BA CC CA AA BA CC CC
# Levels: CA < BA < AA < CC < B
is.ordered(o.set1)
# [1] TRUE
As do the functions as.numeric
and table
.
as.numeric(o.set1)
# [1] 3 5 2 4 1 3 2 4 4
table(o.set1)
# o.set1
# CA BA AA CC B
# 1 2 2 3 1
So, to summarize, factor()
by itself just creates essentially a non-hierarchical sorted factor of your categorical data; factor()
with the levels
and ordered = TRUE
arguments create hierarchical categories.
Alternatively, use ordered()
if you directly want to create ordered factors. The order of the categories still need to be specified:
ordered(set1, levels = c("CA", "BA", "AA", "CC", "B"))