I use factors somewhat infrequently and generally find them comprehensible, but I often am fuzzy about the details for specific operations. Currently, I am coding/collapsing cat
A late entry
Here is a wrapper for plyr::mapvalues
which allows the a remaining
argument (your other
)
library(plyr)
Mapvalues <- function(x, from, to, warn_missing= TRUE, remaining = NULL){
if(!is.null(remaining)){
therest <- setdiff(x, from)
from <- c(from, therest)
to <- c(to, rep_len(remaining, length(therest)))
}
mapvalues(x, from, to, warn_missing)
}
# replace the remaining values with "other"
Mapvalues(data$naics, top8, top8_desc,remaining = 'other')
# leave the remaining values alone
Mapvalues(data$naics, top8, top8_desc)
I have writen a function to do this that can be usefull to others may be? I first check in a relative manner, if a level occures less then mp percent of the base. After that I check to limit the max number of levels to be ml.
ds is the data set at hand of type data.frame, I do this for all columns that appear in cat_var_names as factors.
cat_var_names <- names(clean_base[sapply(clean_base, is.factor)])
recodeLevels <- function (ds = clean_base, var_list = cat_var_names, mp = 0.01, ml = 25) {
# remove less frequent levels in factor
#
n <- nrow(ds)
# keep levels with more then mp percent of cases
for (i in var_list){
keep <- levels(ds[[i]])[table(ds[[i]]) > mp * n]
levels(ds[[i]])[which(!levels(ds[[i]])%in%keep)] <- "other"
}
# keep top ml levels
for (i in var_list){
keep <- names(sort(table(ds[i]),decreasing=TRUE)[1:ml])
levels(ds[[i]])[which(!levels(ds[[i]])%in%keep)] <- "other"
}
return(ds)
}
You can use forcats::fct_other()
:
library(forcats)
data$naics <- fct_other(data$naics, keep = top8, other_level = 'other')
Or using fct_other()
as a part of a dplyr::mutate()
:
library(dplyr)
data <- mutate(data, naics = fct_other(naics, keep = top8, other_level = 'other'))
data %>% head(10)
employees naics
1 420 other
2 264 other
3 189 other
4 157 621610
5 376 621610
6 236 other
7 658 621320
8 959 621320
9 216 other
10 156 other
Note that if the argument other_level
is not set, the other levels default to 'Other' (uppercase 'O').
And conversely, if you had a only a few factors you wanted converted to 'other', you could use the argument drop
instead:
data %>%
mutate(keep_fct = fct_other(naics, keep = top8, other_level = 'other'),
drop_fct = fct_other(naics, drop = top8, other_level = 'other')) %>%
head(10)
employees naics keep_fct drop_fct
1 474 621491 other 621491
2 805 621111 621111 other
3 434 621910 other 621910
4 845 621111 621111 other
5 243 621340 other 621340
6 466 621493 other 621493
7 369 621111 621111 other
8 57 621493 other 621493
9 144 621491 other 621491
10 786 621910 other 621910
dpylr
also has recode_factor()
where you can set the .default
argument to other, but with a larger number of levels to recode, like with this example, could be tedious:
data %>%
mutate(naices = recode_factor(naics, `621111` = '621111', `621210` = '621210', `621399` = '621399', `621610` = '621610', `621330` = '621330', `621310` = '621310', `621511` = '621511', `621420` = '621420', `621320` = '621320', .default = 'other'))
I think the easiest way is to relabel all the naics not in the top 8 to a special value.
data$naics[!(data$naics %in% top8)] = -99
Then you can use the "exclude" option when turning it into a factor
factor(data$naics, exclude=-99)