Recode, collapse, and order factor levels using a single function with regex matching

三世轮回 提交于 2020-07-05 13:33:13

问题


I find manipulating factor variables in R unduly complicated. Things I frequently want to do when cleaning factors include:

  • Resorting levels – not just to set a reference category, but also put all levels in a logical (non-alphabetical order) for summary tables. x <- factor(x, levels = new.order)
  • Recode / rename factor levels – to simplify names and/or collapse multiple categories into one group. For one-to-one recoding levels(x) <- new.levels(x) or plyr::revalue, see here or here for examples. car::recode can perform several one-to-many matches in a single statement, but doesn't support regex matching.

  • Drop levels – not just drop unused levels, but set some levels to missing. (Eg. those with error codes). x <- factor(as.character(x), exclude = drop.levels)

  • Add levels – to show categories with zero counts.

What would be great is to have a single function that can do all of the above at once, allows fuzzy (regex) matching for recoding and dropping factors, can be used within other functions (eg. lapply or dplyr::mutate), and has a simple (consistent) syntax.

I’ve posted my best attempt at this as an answer below, but please let me know if I've missed a function that already exists or if the code can be improved.

EDIT

I've been made aware of the forcats package, which is subtitled Tools for working with Categorical Variables (Factors). The package has many options for resorting levels ('fct_infreq', 'fct_reorder', 'fct_relevel', ...), recoding/grouping levels ('fct_recode', 'fct_lump', 'fct_collapse'), dropping levels ('fct_recode'), and adding levels ('fct_expand'). But doesn't, as yet, support regex matching.


回答1:


Edit: A few years later I've added the xfactor function on github to accomplish the above. It is still a work in progress so please let me know if there are any bugs etc.

devtools::install_github("jwilliman/xfactor")

library(xfactor)

# Create example factor
x <- xfactor(c("dogfish", "rabbit","catfish", "mouse", "dirt"))
levels(x)
#> [1] "catfish" "dirt"    "dogfish" "mouse"   "rabbit"

# Factor levels can be reordered by passing an unnamed vector to the levels
# statement. Levels not included in the replace statement get moved to the end
# or dropped if exclude = TRUE.
xfactor(x, levels = c("mouse", "rabbit"))
#> [1] dogfish rabbit  catfish mouse   dirt   
#> Levels: mouse rabbit catfish dirt dogfish

xfactor(x, levels = c("mouse", "rabbit"), exclude = TRUE)
#> [1] <NA>   rabbit <NA>   mouse  <NA>  
#> Levels: mouse rabbit

# Factor levels can be recoded, collapse, and ordered by passing a named
# vector to the levels statement. Where the vector names are the new factor
# levels and the vector values are regex expressions for the old levels.
# Duplicated new levels will be collapsed.

xfactor(x, levels = c("Sea" = "fish", "Land" = "rab|mou"))
#> [1] Sea  Land Sea  Land dirt
#> Levels: Sea Land dirt

# Factor levels can be dropped by passing a regex expression (or vector) to
# the exclude statement

xfactor(x, exclude = "fish")
#> [1] <NA>   rabbit <NA>   mouse  dirt  
#> Levels: dirt mouse rabbit

# The function will work within other functions

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data.frame(n = 1:5, x)
df %>%
  mutate(y = xfactor(x, levels = c("Sea" = "fish", "Land" = "rab|mou", "Air"), exclude = "di"))
#>   n       x    y
#> 1 1 dogfish  Sea
#> 2 2  rabbit Land
#> 3 3 catfish  Sea
#> 4 4   mouse Land
#> 5 5    dirt <NA>

Created on 2020-04-16 by the reprex package (v0.3.0)



来源:https://stackoverflow.com/questions/37715937/recode-collapse-and-order-factor-levels-using-a-single-function-with-regex-mat

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!