R: factor levels, recode rest to 'other'

家住魔仙堡 提交于 2019-11-29 01:53:35

I think the easiest way is to relabel all the naics not in the top 8 to a special value.

data$naics[!(data$naics %in% top8)] = -99

Then you can use the "exclude" option when turning it into a factor

factor(data$naics, exclude=-99)

You can use forcats::fct_other():

library(forcats)
data$naics <- fct_other(data$naics, keep = top8, other_level = 'other')

Or using fct_other() as a part of a dplyr::mutate():

library(dplyr)
data <- mutate(data, naics = fct_other(naics, keep = top8, other_level = 'other')) 

data %>% head(10)
   employees  naics
1        420  other
2        264  other
3        189  other
4        157 621610
5        376 621610
6        236  other
7        658 621320
8        959 621320
9        216  other
10       156  other

Note that if the argument other_level is not set, the other levels default to 'Other' (uppercase 'O').

And conversely, if you had a only a few factors you wanted converted to 'other', you could use the argument drop instead:

data %>%  
  mutate(keep_fct = fct_other(naics, keep = top8, other_level = 'other'),
         drop_fct = fct_other(naics, drop = top8, other_level = 'other')) %>% 
  head(10)

   employees  naics keep_fct drop_fct
1        474 621491    other   621491
2        805 621111   621111    other
3        434 621910    other   621910
4        845 621111   621111    other
5        243 621340    other   621340
6        466 621493    other   621493
7        369 621111   621111    other
8         57 621493    other   621493
9        144 621491    other   621491
10       786 621910    other   621910

dpylr also has recode_factor() where you can set the .default argument to other, but with a larger number of levels to recode, like with this example, could be tedious:

data %>% 
   mutate(naices = recode_factor(naics, `621111` = '621111', `621210` = '621210', `621399` = '621399', `621610` = '621610', `621330` = '621330', `621310` = '621310', `621511` = '621511', `621420` = '621420', `621320` = '621320', .default = 'other'))

A late entry

Here is a wrapper for plyr::mapvalues which allows the a remaining argument (your other)

library(plyr)

Mapvalues <- function(x, from, to, warn_missing= TRUE, remaining = NULL){
  if(!is.null(remaining)){
    therest <- setdiff(x, from)
    from <- c(from, therest)
    to <- c(to, rep_len(remaining, length(therest)))
  }
  mapvalues(x, from, to, warn_missing)
}
# replace the remaining values with "other"
Mapvalues(data$naics, top8, top8_desc,remaining = 'other')
# leave the remaining values alone
Mapvalues(data$naics, top8, top8_desc)

I have writen a function to do this that can be usefull to others may be? I first check in a relative manner, if a level occures less then mp percent of the base. After that I check to limit the max number of levels to be ml.

ds is the data set at hand of type data.frame, I do this for all columns that appear in cat_var_names as factors.

cat_var_names <- names(clean_base[sapply(clean_base, is.factor)])

recodeLevels <- function (ds = clean_base, var_list = cat_var_names, mp = 0.01, ml = 25) {
  # remove less frequent levels in factor
  # 
  n <- nrow(ds)
  # keep levels with more then mp percent of cases
  for (i in var_list){
    keep <- levels(ds[[i]])[table(ds[[i]]) > mp * n]
    levels(ds[[i]])[which(!levels(ds[[i]])%in%keep)] <- "other"
  }

  # keep top ml levels
  for (i in var_list){
    keep <- names(sort(table(ds[i]),decreasing=TRUE)[1:ml])
    levels(ds[[i]])[which(!levels(ds[[i]])%in%keep)] <- "other"
  }
  return(ds)
}
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!