R: factor levels, recode rest to 'other'

后端 未结 4 1688
夕颜
夕颜 2021-02-08 03:28

I use factors somewhat infrequently and generally find them comprehensible, but I often am fuzzy about the details for specific operations. Currently, I am coding/collapsing cat

4条回答
  •  日久生厌
    2021-02-08 04:10

    I have writen a function to do this that can be usefull to others may be? I first check in a relative manner, if a level occures less then mp percent of the base. After that I check to limit the max number of levels to be ml.

    ds is the data set at hand of type data.frame, I do this for all columns that appear in cat_var_names as factors.

    cat_var_names <- names(clean_base[sapply(clean_base, is.factor)])
    
    recodeLevels <- function (ds = clean_base, var_list = cat_var_names, mp = 0.01, ml = 25) {
      # remove less frequent levels in factor
      # 
      n <- nrow(ds)
      # keep levels with more then mp percent of cases
      for (i in var_list){
        keep <- levels(ds[[i]])[table(ds[[i]]) > mp * n]
        levels(ds[[i]])[which(!levels(ds[[i]])%in%keep)] <- "other"
      }
    
      # keep top ml levels
      for (i in var_list){
        keep <- names(sort(table(ds[i]),decreasing=TRUE)[1:ml])
        levels(ds[[i]])[which(!levels(ds[[i]])%in%keep)] <- "other"
      }
      return(ds)
    }
    

提交回复
热议问题