问题
I find manipulating factor variables in R unduly complicated. Things I frequently want to do when cleaning factors include:
- Resorting levels – not just to set a reference category, but also put all levels in a logical (non-alphabetical order) for summary tables.
x <- factor(x, levels = new.order)
Recode / rename factor levels – to simplify names and/or collapse multiple categories into one group. For one-to-one recoding
levels(x) <- new.levels(x)
orplyr::revalue
, see here or here for examples.car::recode
can perform several one-to-many matches in a single statement, but doesn't support regex matching.Drop levels – not just drop unused levels, but set some levels to missing. (Eg. those with error codes).
x <- factor(as.character(x), exclude = drop.levels)
- Add levels – to show categories with zero counts.
What would be great is to have a single function that can do all of the above at once, allows fuzzy (regex) matching for recoding and dropping factors, can be used within other functions (eg. lapply
or dplyr::mutate
), and has a simple (consistent) syntax.
I’ve posted my best attempt at this as an answer below, but please let me know if I've missed a function that already exists or if the code can be improved.
EDIT
I've been made aware of the forcats
package, which is subtitled Tools for working with Categorical Variables (Factors). The package has many options for resorting levels ('fct_infreq', 'fct_reorder', 'fct_relevel', ...), recoding/grouping levels ('fct_recode', 'fct_lump', 'fct_collapse'), dropping levels ('fct_recode'), and adding levels ('fct_expand'). But doesn't, as yet, support regex matching.
回答1:
Edit: A few years later I've added the xfactor
function on github to accomplish the above. It is still a work in progress so please let me know if there are any bugs etc.
devtools::install_github("jwilliman/xfactor")
library(xfactor)
# Create example factor
x <- xfactor(c("dogfish", "rabbit","catfish", "mouse", "dirt"))
levels(x)
#> [1] "catfish" "dirt" "dogfish" "mouse" "rabbit"
# Factor levels can be reordered by passing an unnamed vector to the levels
# statement. Levels not included in the replace statement get moved to the end
# or dropped if exclude = TRUE.
xfactor(x, levels = c("mouse", "rabbit"))
#> [1] dogfish rabbit catfish mouse dirt
#> Levels: mouse rabbit catfish dirt dogfish
xfactor(x, levels = c("mouse", "rabbit"), exclude = TRUE)
#> [1] <NA> rabbit <NA> mouse <NA>
#> Levels: mouse rabbit
# Factor levels can be recoded, collapse, and ordered by passing a named
# vector to the levels statement. Where the vector names are the new factor
# levels and the vector values are regex expressions for the old levels.
# Duplicated new levels will be collapsed.
xfactor(x, levels = c("Sea" = "fish", "Land" = "rab|mou"))
#> [1] Sea Land Sea Land dirt
#> Levels: Sea Land dirt
# Factor levels can be dropped by passing a regex expression (or vector) to
# the exclude statement
xfactor(x, exclude = "fish")
#> [1] <NA> rabbit <NA> mouse dirt
#> Levels: dirt mouse rabbit
# The function will work within other functions
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data.frame(n = 1:5, x)
df %>%
mutate(y = xfactor(x, levels = c("Sea" = "fish", "Land" = "rab|mou", "Air"), exclude = "di"))
#> n x y
#> 1 1 dogfish Sea
#> 2 2 rabbit Land
#> 3 3 catfish Sea
#> 4 4 mouse Land
#> 5 5 dirt <NA>
Created on 2020-04-16 by the reprex package (v0.3.0)
来源:https://stackoverflow.com/questions/37715937/recode-collapse-and-order-factor-levels-using-a-single-function-with-regex-mat