I have a large database made up of mixed data types (numeric, character, factor, ordinal factor) with missing values, and I am trying to create a for loop to substitute the
If you simply remove the obvious bugs then it works as intended:
Mode <- function (x, na.rm) {
xtab <- table(x)
xmode <- names(which(xtab == max(xtab)))
if (length(xmode) > 1) xmode <- ">1 mode"
return(xmode)
}
# fake array:
age <- c(5, 8, 10, 12, NA)
a <- factor(c("aa", "bb", NA, "cc", "cc"))
b <- c("banana", "apple", "pear", "grape", NA)
df_test <- data.frame(age=age, a=a, b=b)
df_test$b <- as.character(df_test$b)
print(df_test)
# age a b
# 1 5 aa banana
# 2 8 bb apple
# 3 10 <NA> pear
# 4 12 cc grape
# 5 NA cc <NA>
for (var in 1:ncol(df_test)) {
if (class(df_test[,var])=="numeric") {
df_test[is.na(df_test[,var]),var] <- mean(df_test[,var], na.rm = TRUE)
} else if (class(df_test[,var]) %in% c("character", "factor")) {
df_test[is.na(df_test[,var]),var] <- Mode(df_test[,var], na.rm = TRUE)
}
}
print(df_test)
# age a b
# 1 5.00 aa banana
# 2 8.00 bb apple
# 3 10.00 cc pear
# 4 12.00 cc grape
# 5 8.75 cc >1 mode
I recommend that you use an editor with syntax highlighting and bracket matching, which would make it easier to find these sorts of syntax errors.
First, you need to write the mode function taking into consideration the missing values of the Categorical data, which are of length<1.
The mode function:
getmode <- function(v){
v=v[nchar(as.character(v))>0]
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
Then you can iterate of columns and if the column is numeric to fill the missing values with the mean otherwise with the mode.
The loop statement below:
for (cols in colnames(df)) {
if (cols %in% names(df[,sapply(df, is.numeric)])) {
df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), is.na(!!rlang::sym(cols)), mean(!!rlang::sym(cols), na.rm=TRUE)))
}
else {
df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), !!rlang::sym(cols)=="", getmode(!!rlang::sym(cols))))
}
}
Let's provide an example:
library(tidyverse)
df<-tibble(id=seq(1,10), ColumnA=c(10,9,8,7,NA,NA,20,15,12,NA),
ColumnB=factor(c("A","B","A","A","","B","A","B","","A")),
ColumnC=factor(c("","BB","CC","BB","BB","CC","AA","BB","","AA")),
ColumnD=c(NA,20,18,22,18,17,19,NA,17,23)
)
df
The initial df with the missing values:
# A tibble: 10 x 5
id ColumnA ColumnB ColumnC ColumnD
<int> <dbl> <fct> <fct> <dbl>
1 1 10 "A" "" NA
2 2 9 "B" "BB" 20
3 3 8 "A" "CC" 18
4 4 7 "A" "BB" 22
5 5 NA "" "BB" 18
6 6 NA "B" "CC" 17
7 7 20 "A" "AA" 19
8 8 15 "B" "BB" NA
9 9 12 "" "" 17
10 10 NA "A" "AA" 23
By running the for loop above, we get:
# A tibble: 10 x 5
id ColumnA ColumnB ColumnC ColumnD
<dbl> <dbl> <fct> <fct> <dbl>
1 1 10 A BB 19.2
2 2 9 B BB 20
3 3 8 A CC 18
4 4 7 A BB 22
5 5 11.6 A BB 18
6 6 11.6 B CC 17
7 7 20 A AA 19
8 8 15 B BB 19.2
9 9 12 A BB 17
10 10 11.6 A AA 23
As we can see, the missing values have been imputed. You can see an example here