R: factor levels, recode rest to 'other'

后端 未结 4 1685
夕颜
夕颜 2021-02-08 03:28

I use factors somewhat infrequently and generally find them comprehensible, but I often am fuzzy about the details for specific operations. Currently, I am coding/collapsing cat

相关标签:
4条回答
  • 2021-02-08 04:06

    A late entry

    Here is a wrapper for plyr::mapvalues which allows the a remaining argument (your other)

    library(plyr)
    
    Mapvalues <- function(x, from, to, warn_missing= TRUE, remaining = NULL){
      if(!is.null(remaining)){
        therest <- setdiff(x, from)
        from <- c(from, therest)
        to <- c(to, rep_len(remaining, length(therest)))
      }
      mapvalues(x, from, to, warn_missing)
    }
    # replace the remaining values with "other"
    Mapvalues(data$naics, top8, top8_desc,remaining = 'other')
    # leave the remaining values alone
    Mapvalues(data$naics, top8, top8_desc)
    
    0 讨论(0)
  • 2021-02-08 04:10

    I have writen a function to do this that can be usefull to others may be? I first check in a relative manner, if a level occures less then mp percent of the base. After that I check to limit the max number of levels to be ml.

    ds is the data set at hand of type data.frame, I do this for all columns that appear in cat_var_names as factors.

    cat_var_names <- names(clean_base[sapply(clean_base, is.factor)])
    
    recodeLevels <- function (ds = clean_base, var_list = cat_var_names, mp = 0.01, ml = 25) {
      # remove less frequent levels in factor
      # 
      n <- nrow(ds)
      # keep levels with more then mp percent of cases
      for (i in var_list){
        keep <- levels(ds[[i]])[table(ds[[i]]) > mp * n]
        levels(ds[[i]])[which(!levels(ds[[i]])%in%keep)] <- "other"
      }
    
      # keep top ml levels
      for (i in var_list){
        keep <- names(sort(table(ds[i]),decreasing=TRUE)[1:ml])
        levels(ds[[i]])[which(!levels(ds[[i]])%in%keep)] <- "other"
      }
      return(ds)
    }
    
    0 讨论(0)
  • 2021-02-08 04:11

    You can use forcats::fct_other():

    library(forcats)
    data$naics <- fct_other(data$naics, keep = top8, other_level = 'other')
    

    Or using fct_other() as a part of a dplyr::mutate():

    library(dplyr)
    data <- mutate(data, naics = fct_other(naics, keep = top8, other_level = 'other')) 
    
    data %>% head(10)
       employees  naics
    1        420  other
    2        264  other
    3        189  other
    4        157 621610
    5        376 621610
    6        236  other
    7        658 621320
    8        959 621320
    9        216  other
    10       156  other
    

    Note that if the argument other_level is not set, the other levels default to 'Other' (uppercase 'O').

    And conversely, if you had a only a few factors you wanted converted to 'other', you could use the argument drop instead:

    data %>%  
      mutate(keep_fct = fct_other(naics, keep = top8, other_level = 'other'),
             drop_fct = fct_other(naics, drop = top8, other_level = 'other')) %>% 
      head(10)
    
       employees  naics keep_fct drop_fct
    1        474 621491    other   621491
    2        805 621111   621111    other
    3        434 621910    other   621910
    4        845 621111   621111    other
    5        243 621340    other   621340
    6        466 621493    other   621493
    7        369 621111   621111    other
    8         57 621493    other   621493
    9        144 621491    other   621491
    10       786 621910    other   621910
    

    dpylr also has recode_factor() where you can set the .default argument to other, but with a larger number of levels to recode, like with this example, could be tedious:

    data %>% 
       mutate(naices = recode_factor(naics, `621111` = '621111', `621210` = '621210', `621399` = '621399', `621610` = '621610', `621330` = '621330', `621310` = '621310', `621511` = '621511', `621420` = '621420', `621320` = '621320', .default = 'other'))
    
    0 讨论(0)
  • 2021-02-08 04:25

    I think the easiest way is to relabel all the naics not in the top 8 to a special value.

    data$naics[!(data$naics %in% top8)] = -99
    

    Then you can use the "exclude" option when turning it into a factor

    factor(data$naics, exclude=-99)
    
    0 讨论(0)
提交回复
热议问题