I have used the below tapply
function to get the median of Age based on Pclass.
Now how can I impute those median values to NA values based on Pclass?
Try the following.
set.seed(1)
df1 <- data.frame(Pclass = sample(1:3, 20, TRUE),
Age = sample(c(NA, 20:40), 20, TRUE, prob = c(10, rep(1, 21))))
new <- ave(df1$Age, df1$Pclass, FUN = function(x) median(x, na.rm = TRUE))
df1$Age[is.na(df1$Age)] <- new[is.na(df1$Age)]
Final clean up.
rm(new)
Here is another base R
approach that uses replace
and ave
.
df1 <- transform(df1,
Age = ave(Age, Pclass, FUN = function(x) replace(x, is.na(x), median(x, na.rm = T))))
df1
# Pclass Age
# 1 A 1
# 2 A 2
# 3 A 3
# 4 B 4
# 5 B 5
# 6 B 6
# 7 C 7
# 8 C 8
# 9 C 9
Same idea but using data.table
library(data.table)
setDT(df1)
df1[, Age := as.integer(replace(Age, is.na(Age), median(Age, na.rm = T))), by = Pclass]
df1
data
df1 <- data.frame(Pclass = rep(LETTERS[1:3], each = 3),
Age = 1:9)
df1$Age[c(FALSE, TRUE, FALSE)] <- NA
df1
# Pclass Age
# 1 A 1
# 2 A NA
# 3 A 3
# 4 B 4
# 5 B NA
# 6 B 6
# 7 C 7
# 8 C NA
# 9 C 9