Replace mean or mode for missing values in R

后端 未结 2 1151
终归单人心
终归单人心 2020-12-18 13:09

I have a large database made up of mixed data types (numeric, character, factor, ordinal factor) with missing values, and I am trying to create a for loop to substitute the

相关标签:
2条回答
  • 2020-12-18 13:45

    If you simply remove the obvious bugs then it works as intended:

    Mode <- function (x, na.rm) {
        xtab <- table(x)
        xmode <- names(which(xtab == max(xtab)))
        if (length(xmode) > 1) xmode <- ">1 mode"
        return(xmode)
    }
    
    # fake array:
    age <- c(5, 8, 10, 12, NA)
    a <- factor(c("aa", "bb", NA, "cc", "cc"))
    b <- c("banana", "apple", "pear", "grape", NA)
    df_test <- data.frame(age=age, a=a, b=b)
    df_test$b <- as.character(df_test$b)
    
    print(df_test)
    
    #   age    a      b
    # 1   5   aa banana
    # 2   8   bb  apple
    # 3  10 <NA>   pear
    # 4  12   cc  grape
    # 5  NA   cc   <NA>
    
    for (var in 1:ncol(df_test)) {
        if (class(df_test[,var])=="numeric") {
            df_test[is.na(df_test[,var]),var] <- mean(df_test[,var], na.rm = TRUE)
        } else if (class(df_test[,var]) %in% c("character", "factor")) {
            df_test[is.na(df_test[,var]),var] <- Mode(df_test[,var], na.rm = TRUE)
        }
    }
    
    print(df_test)
    
    #     age  a       b
    # 1  5.00 aa  banana
    # 2  8.00 bb   apple
    # 3 10.00 cc    pear
    # 4 12.00 cc   grape
    # 5  8.75 cc >1 mode
    

    I recommend that you use an editor with syntax highlighting and bracket matching, which would make it easier to find these sorts of syntax errors.

    0 讨论(0)
  • 2020-12-18 13:50

    First, you need to write the mode function taking into consideration the missing values of the Categorical data, which are of length<1.
    The mode function:

    getmode <- function(v){
      v=v[nchar(as.character(v))>0]
      uniqv <- unique(v)
      uniqv[which.max(tabulate(match(v, uniqv)))]
    }
    

    Then you can iterate of columns and if the column is numeric to fill the missing values with the mean otherwise with the mode.

    The loop statement below:

    for (cols in colnames(df)) {
      if (cols %in% names(df[,sapply(df, is.numeric)])) {
        df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), is.na(!!rlang::sym(cols)), mean(!!rlang::sym(cols), na.rm=TRUE)))
    
      }
      else {
    
        df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), !!rlang::sym(cols)=="", getmode(!!rlang::sym(cols))))
    
      }
    }
    

    Let's provide an example:

    library(tidyverse)
    
    df<-tibble(id=seq(1,10), ColumnA=c(10,9,8,7,NA,NA,20,15,12,NA), 
               ColumnB=factor(c("A","B","A","A","","B","A","B","","A")),
               ColumnC=factor(c("","BB","CC","BB","BB","CC","AA","BB","","AA")),
               ColumnD=c(NA,20,18,22,18,17,19,NA,17,23)
               )
    
    df
    

    The initial df with the missing values:

    # A tibble: 10 x 5
          id ColumnA ColumnB ColumnC ColumnD
       <int>   <dbl> <fct>   <fct>     <dbl>
     1     1      10 "A"     ""           NA
     2     2       9 "B"     "BB"         20
     3     3       8 "A"     "CC"         18
     4     4       7 "A"     "BB"         22
     5     5      NA ""      "BB"         18
     6     6      NA "B"     "CC"         17
     7     7      20 "A"     "AA"         19
     8     8      15 "B"     "BB"         NA
     9     9      12 ""      ""           17
    10    10      NA "A"     "AA"         23
    

    By running the for loop above, we get:

    # A tibble: 10 x 5
          id ColumnA ColumnB ColumnC ColumnD
       <dbl>   <dbl> <fct>   <fct>     <dbl>
     1     1    10   A       BB         19.2
     2     2     9   B       BB         20  
     3     3     8   A       CC         18  
     4     4     7   A       BB         22  
     5     5    11.6 A       BB         18  
     6     6    11.6 B       CC         17  
     7     7    20   A       AA         19  
     8     8    15   B       BB         19.2
     9     9    12   A       BB         17  
    10    10    11.6 A       AA         23 
    

    As we can see, the missing values have been imputed. You can see an example here

    0 讨论(0)
提交回复
热议问题