Cleaning up factor levels (collapsing multiple levels/labels)

后端未结

关注

 10  1901

What is the most effective (ie efficient / appropriate) way to clean up a factor containing multiple levels that need to be collapsed? That is, how to combine two or more fa

相关标签:

10条回答

感情败类

2020-11-22 14:54

You may use the below function for combining/collapsing multiple factors:

combofactor <- function(pattern_vector,
         replacement_vector,
         data) {
 levels <- levels(data)
 for (i in 1:length(pattern_vector))
      levels[which(pattern_vector[i] == levels)] <-
        replacement_vector[i]
 levels(data) <- levels
  data
}

Example:

Initialize x

x <- factor(c(rep("Y",20),rep("N",20),rep("y",20),
rep("yes",20),rep("Yes",20),rep("No",20)))

Check the structure

str(x)
# Factor w/ 6 levels "N","No","y","Y",..: 4 4 4 4 4 4 4 4 4 4 ...

Use the function:

x_new <- combofactor(c("Y","N","y","yes"),c("Yes","No","Yes","Yes"),x)

Recheck the structure:

str(x_new)
# Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...

0 讨论(0)

太阳男子

2020-11-22 14:55

Similar to @Aaron's approach, but slightly simpler would be:

x <- c("Y", "Y", "Yes", "N", "No", "H")
x <- factor(x)
# levels(x)  
# [1] "H"   "N"   "No"  "Y"   "Yes"
# NB: the offending levels are 1, 2, & 4
levels(x)[c(1,2,4)] <- c(NA, "No", "Yes")
x
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

0 讨论(0)

夕颜

2020-11-22 14:57

Another way is to make a table containing the mapping:

# stacking the list from Aaron's answer
fmap = stack(list(Yes = c("Y", "Yes"), No = c("N", "No")))

fmap$ind[ match(x, fmap$values) ]
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

# or...

library(data.table)
setDT(fmap)[x, on=.(values), ind ]
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

I prefer this way, since it leaves behind an easily inspected object summarizing the map; and the data.table code looks just like any other join in that syntax.

Of course, if you don't want an object like fmap summarizing the change, it can be a "one-liner":

library(data.table)
setDT(stack(list(Yes = c("Y", "Yes"), No = c("N", "No"))))[x, on=.(values), ind ]
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

0 讨论(0)

醉话见心

2020-11-22 15:01
Since R 3.5.0 (2018-04-23) you can do this in one clear and simple line:
```
x = c("Y", "Y", "Yes", "N", "No", "H") # The 'H' should be treated as NA

tmp = factor(x, levels= c("Y", "Yes", "N", "No"), labels= c("Yes", "Yes", "No", "No"))
tmp
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: Yes No
```
1 line, maps multiple values to the same level, sets NA for missing levels" – h/t @Aaron
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2