I use factors somewhat infrequently and generally find them comprehensible, but I often am fuzzy about the details for specific operations. Currently, I am coding/collapsing cat
You can use forcats::fct_other()
:
library(forcats)
data$naics <- fct_other(data$naics, keep = top8, other_level = 'other')
Or using fct_other()
as a part of a dplyr::mutate()
:
library(dplyr)
data <- mutate(data, naics = fct_other(naics, keep = top8, other_level = 'other'))
data %>% head(10)
employees naics
1 420 other
2 264 other
3 189 other
4 157 621610
5 376 621610
6 236 other
7 658 621320
8 959 621320
9 216 other
10 156 other
Note that if the argument other_level
is not set, the other levels default to 'Other' (uppercase 'O').
And conversely, if you had a only a few factors you wanted converted to 'other', you could use the argument drop
instead:
data %>%
mutate(keep_fct = fct_other(naics, keep = top8, other_level = 'other'),
drop_fct = fct_other(naics, drop = top8, other_level = 'other')) %>%
head(10)
employees naics keep_fct drop_fct
1 474 621491 other 621491
2 805 621111 621111 other
3 434 621910 other 621910
4 845 621111 621111 other
5 243 621340 other 621340
6 466 621493 other 621493
7 369 621111 621111 other
8 57 621493 other 621493
9 144 621491 other 621491
10 786 621910 other 621910
dpylr
also has recode_factor()
where you can set the .default
argument to other, but with a larger number of levels to recode, like with this example, could be tedious:
data %>%
mutate(naices = recode_factor(naics, `621111` = '621111', `621210` = '621210', `621399` = '621399', `621610` = '621610', `621330` = '621330', `621310` = '621310', `621511` = '621511', `621420` = '621420', `621320` = '621320', .default = 'other'))