What is the most effective (ie efficient / appropriate) way to clean up a factor containing multiple levels that need to be collapsed? That is, how to combine two or more fa
You may use the below function for combining/collapsing multiple factors:
combofactor <- function(pattern_vector,
replacement_vector,
data) {
levels <- levels(data)
for (i in 1:length(pattern_vector))
levels[which(pattern_vector[i] == levels)] <-
replacement_vector[i]
levels(data) <- levels
data
}
Example:
Initialize x
x <- factor(c(rep("Y",20),rep("N",20),rep("y",20),
rep("yes",20),rep("Yes",20),rep("No",20)))
Check the structure
str(x)
# Factor w/ 6 levels "N","No","y","Y",..: 4 4 4 4 4 4 4 4 4 4 ...
Use the function:
x_new <- combofactor(c("Y","N","y","yes"),c("Yes","No","Yes","Yes"),x)
Recheck the structure:
str(x_new)
# Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
Similar to @Aaron's approach, but slightly simpler would be:
x <- c("Y", "Y", "Yes", "N", "No", "H")
x <- factor(x)
# levels(x)
# [1] "H" "N" "No" "Y" "Yes"
# NB: the offending levels are 1, 2, & 4
levels(x)[c(1,2,4)] <- c(NA, "No", "Yes")
x
# [1] Yes Yes Yes No No <NA>
# Levels: No Yes
Another way is to make a table containing the mapping:
# stacking the list from Aaron's answer
fmap = stack(list(Yes = c("Y", "Yes"), No = c("N", "No")))
fmap$ind[ match(x, fmap$values) ]
# [1] Yes Yes Yes No No <NA>
# Levels: No Yes
# or...
library(data.table)
setDT(fmap)[x, on=.(values), ind ]
# [1] Yes Yes Yes No No <NA>
# Levels: No Yes
I prefer this way, since it leaves behind an easily inspected object summarizing the map; and the data.table code looks just like any other join in that syntax.
Of course, if you don't want an object like fmap
summarizing the change, it can be a "one-liner":
library(data.table)
setDT(stack(list(Yes = c("Y", "Yes"), No = c("N", "No"))))[x, on=.(values), ind ]
# [1] Yes Yes Yes No No <NA>
# Levels: No Yes
Since R 3.5.0 (2018-04-23) you can do this in one clear and simple line:
x = c("Y", "Y", "Yes", "N", "No", "H") # The 'H' should be treated as NA
tmp = factor(x, levels= c("Y", "Yes", "N", "No"), labels= c("Yes", "Yes", "No", "No"))
tmp
# [1] Yes Yes Yes No No <NA>
# Levels: Yes No
1 line, maps multiple values to the same level, sets NA for missing levels" – h/t @Aaron