Cleaning up factor levels (collapsing multiple levels/labels)

后端 未结 10 1901
礼貌的吻别
礼貌的吻别 2020-11-22 14:27

What is the most effective (ie efficient / appropriate) way to clean up a factor containing multiple levels that need to be collapsed? That is, how to combine two or more fa

相关标签:
10条回答
  • 2020-11-22 14:54

    You may use the below function for combining/collapsing multiple factors:

    combofactor <- function(pattern_vector,
             replacement_vector,
             data) {
     levels <- levels(data)
     for (i in 1:length(pattern_vector))
          levels[which(pattern_vector[i] == levels)] <-
            replacement_vector[i]
     levels(data) <- levels
      data
    }
    

    Example:

    Initialize x

    x <- factor(c(rep("Y",20),rep("N",20),rep("y",20),
    rep("yes",20),rep("Yes",20),rep("No",20)))
    

    Check the structure

    str(x)
    # Factor w/ 6 levels "N","No","y","Y",..: 4 4 4 4 4 4 4 4 4 4 ...
    

    Use the function:

    x_new <- combofactor(c("Y","N","y","yes"),c("Yes","No","Yes","Yes"),x)
    

    Recheck the structure:

    str(x_new)
    # Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
    
    0 讨论(0)
  • 2020-11-22 14:55

    Similar to @Aaron's approach, but slightly simpler would be:

    x <- c("Y", "Y", "Yes", "N", "No", "H")
    x <- factor(x)
    # levels(x)  
    # [1] "H"   "N"   "No"  "Y"   "Yes"
    # NB: the offending levels are 1, 2, & 4
    levels(x)[c(1,2,4)] <- c(NA, "No", "Yes")
    x
    # [1] Yes  Yes  Yes  No   No   <NA>
    # Levels: No Yes
    
    0 讨论(0)
  • 2020-11-22 14:57

    Another way is to make a table containing the mapping:

    # stacking the list from Aaron's answer
    fmap = stack(list(Yes = c("Y", "Yes"), No = c("N", "No")))
    
    fmap$ind[ match(x, fmap$values) ]
    # [1] Yes  Yes  Yes  No   No   <NA>
    # Levels: No Yes
    
    # or...
    
    library(data.table)
    setDT(fmap)[x, on=.(values), ind ]
    # [1] Yes  Yes  Yes  No   No   <NA>
    # Levels: No Yes
    

    I prefer this way, since it leaves behind an easily inspected object summarizing the map; and the data.table code looks just like any other join in that syntax.


    Of course, if you don't want an object like fmap summarizing the change, it can be a "one-liner":

    library(data.table)
    setDT(stack(list(Yes = c("Y", "Yes"), No = c("N", "No"))))[x, on=.(values), ind ]
    # [1] Yes  Yes  Yes  No   No   <NA>
    # Levels: No Yes
    
    0 讨论(0)
  • 2020-11-22 15:01

    Since R 3.5.0 (2018-04-23) you can do this in one clear and simple line:

    x = c("Y", "Y", "Yes", "N", "No", "H") # The 'H' should be treated as NA
    
    tmp = factor(x, levels= c("Y", "Yes", "N", "No"), labels= c("Yes", "Yes", "No", "No"))
    tmp
    # [1] Yes  Yes  Yes  No   No   <NA>
    # Levels: Yes No
    

    1 line, maps multiple values to the same level, sets NA for missing levels" – h/t @Aaron

    0 讨论(0)
提交回复
热议问题