Cleaning up factor levels (collapsing multiple levels/labels)

后端未结

关注

 10  1935

What is the most effective (ie efficient / appropriate) way to clean up a factor containing multiple levels that need to be collapsed? That is, how to combine two or more fa

相关标签:

10条回答

心在旅途

2020-11-22 14:37
UPDATE 2: See Uwe's answer which shows the new "tidyverse" way of doing this, which is quickly becoming the standard.

UPDATE 1: Duplicated labels (but not levels!) are now indeed allowed (per my comment above); see Tim's answer.

ORIGINAL ANSWER, BUT STILL USEFUL AND OF INTEREST: There is a little known option to pass a named list to the levels function, for exactly this purpose. The names of the list should be the desired names of the levels and the elements should be the current names that should be renamed. Some (including the OP, see Ricardo's comment to Tim's answer) prefer this for ease of reading.
```
x <- c("Y", "Y", "Yes", "N", "No", "H", NA)
x <- factor(x)
levels(x) <- list("Yes"=c("Y", "Yes"), "No"=c("N", "No"))
x
## [1] Yes  Yes  Yes  No   No   <NA>  <NA>
## Levels: Yes No
```
As mentioned in the levels documentation; also see the examples there.

value: For the 'factor' method, a vector of character strings with length at least the number of levels of 'x', or a named list specifying how to rename the levels.

This can also be done in one line, as Marek does here: https://stackoverflow.com/a/10432263/210673; the levels<- sorcery is explained here https://stackoverflow.com/a/10491881/210673.
```
> `levels<-`(factor(x), list(Yes=c("Y", "Yes"), No=c("N", "No")))
[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
春和景丽

2020-11-22 14:38
I add this answer to demonstrate the accepted answer working on a specific factor in a dataframe, since this was not initially obvious to me (though it probably should have been).
```
levels(df$var1)
# "0" "1" "Z"
summary(df$var1)
#    0    1    Z 
# 7012 2507    8 
levels(df$var1) <- list("0"=c("Z", "0"), "1"=c("1"))
levels(df$var1)
# "0" "1"
summary(df$var1)
#    0    1 
# 7020 2507
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
萌比男神i

2020-11-22 14:42
Perhaps a named vector as a key might be of use:
```
> factor(unname(c(Y = "Yes", Yes = "Yes", N = "No", No = "No", H = NA)[x]))
[1] Yes  Yes  Yes  No   No   <NA>
Levels: No Yes
```
This looks very similar to your last attempt... but this one works :-)
0 讨论(0)
发布评论:

提交评论
- 加载中...
走了就别回头了

2020-11-22 14:46
As the question is titled Cleaning up factor levels (collapsing multiple levels/labels), the forcats package should be mentioned here as well, for the sake of completeness. forcats appeared on CRAN in August 2016.

There are several convenience functions available for cleaning up factor levels:
```
x <- c("Y", "Y", "Yes", "N", "No", "H") 

library(forcats)
```
Collapse factor levels into manually defined groups
```
fct_collapse(x, Yes = c("Y", "Yes"), No = c("N", "No"), NULL = "H")
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: No Yes
```
Change factor levels by hand
```
fct_recode(x, Yes = "Y", Yes = "Yes", No = "N", No = "No", NULL = "H")
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: No Yes
```
Automatically relabel factor levels, collapse as necessary
```
fun <- function(z) {
  z[z == "Y"] <- "Yes"
  z[z == "N"] <- "No"
  z[!(z %in% c("Yes", "No"))] <- NA
  z
}
fct_relabel(factor(x), fun)
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: No Yes
```
Note that fct_relabel() works with factor levels, so it expects a factor as first argument. The two other functions, fct_collapse() and fct_recode(), accept also a character vector which is an undocumented feature.

Reorder factor levels by first appearance

The expected output given by the OP is
```
[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No
```
Here the levels are ordered as they appear in x which is different from the default (?factor: The levels of a factor are by default sorted).

To be in line with the expected output, this can be achieved by using fct_inorder() before collapsing the levels:
```
fct_collapse(fct_inorder(x), Yes = c("Y", "Yes"), No = c("N", "No"), NULL = "H")
fct_recode(fct_inorder(x), Yes = "Y", Yes = "Yes", No = "N", No = "No", NULL = "H")
```
Both return the expected output with levels in the same order, now.
0 讨论(0)
发布评论:

提交评论
- 加载中...

庸人自扰

2020-11-22 14:48

First let's note that in this specific case we can use partial matching:

x <- c("Y", "Y", "Yes", "N", "No", "H")
y <- c("Yes","No")
x <- factor(y[pmatch(x,y,duplicates.ok = TRUE)])
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

In a more general case I'd go with dplyr::recode:

library(dplyr)
x <- c("Y", "Y", "Yes", "N", "No", "H")
y <- c(Y="Yes",N="No")
x <- recode(x,!!!y)
x <- factor(x,y)
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: Yes No

Slightly altered if the starting point is a factor:

x <- factor(c("Y", "Y", "Yes", "N", "No", "H"))
y <- c(Y="Yes",N="No")
x <- recode_factor(x,!!!y)
x <- factor(x,y)
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: Yes No

0 讨论(0)

走了就别回头了

2020-11-22 14:54
I don't know your real use-case, but would strtrim be of any use here...
```
factor( strtrim( x , 1 ) , levels = c("Y" , "N" ) , labels = c("Yes" , "No" ) )
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: Yes No
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页

Cleaning up factor levels (collapsing multiple levels/labels)

Collapse factor levels into manually defined groups

Change factor levels by hand

Automatically relabel factor levels, collapse as necessary

Reorder factor levels by first appearance