Median imputation using sapply

喜夏-厌秋 提交于 2019-12-01 11:30:35

This is actually a subtle problem, so worth a bit of discussion (IMO). You have a data frame and want to impute medians for numeric columns only, with the result being, of course, a data frame.

The apply(...) function will coerce it's argument to a matrix first. Since all elements in a matrix must by definition be the same data type, if there are any character or factor columns in the original df, the whole matrix will be coerced to char when it is passed to apply(...).

# 1st column of df is a factor
df <- data.frame(a=letters[1:5],x=sample(1:5,5),y=runif(5))
df[3,]$x <- NA
df[5,]$y <- NA
df
#   a  x         y
# 1 a  5 0.5235779
# 2 b  3 0.2142011
# 3 c NA 0.8886608
# 4 d  4 0.4952574
# 5 e  1        NA

apply(df,2,function(x) {
  if(is.numeric(x)) ifelse(is.na(x),median(x,na.rm=T),x) else x})
#      a   x    y          
# [1,] "a" " 5" "0.5235779"
# [2,] "b" " 3" "0.2142011"
# [3,] "c" NA   "0.8886608"
# [4,] "d" " 4" "0.4952574"
# [5,] "e" " 1" NA         

sapply(df,FUN=f) will pass the columns of df individually to a function f(...), but, the result will be matrix. So, for example, any factors in df will be coerced to integer.

sapply(df,function(x) {
  if(is.numeric(x)) ifelse(is.na(x),median(x,na.rm=T),x) else x})
#      a   x         y
# [1,] 1 5.0 0.5235779
# [2,] 2 3.0 0.2142011
# [3,] 3 3.5 0.8886608
# [4,] 4 4.0 0.4952574
# [5,] 5 1.0 0.5094176

So here, df$x and df$y are correct,but look what happened to df$a: the factor was coerced to numeric by returning the factor levels - not what you want!

lapply(df,FUN=F) will return a list, which can then be converted to a data frame. This approach gives you the desired result:

data.frame(lapply(df,function(x) {
    if(is.numeric(x)) ifelse(is.na(x),median(x,na.rm=T),x) else x}))
#   a   x         y
# 1 a 1.0 0.3093707
# 2 b 3.0 0.3486391
# 3 c 3.5 0.8292446
# 4 d 5.0 0.7882574
# 5 e 4.0 0.5684483

I suppose it's debatable whether this is any better than using a loop...

You could use apply to apply a function across all columns

dat<-data.frame(c1=c(1,2,3,NA),c2=c(10, NA, 20, 30))
apply(dat, 2, function(x) ifelse(is.na(x), median(x, na.rm=T), x))

slightly faster

imputeMedianv3<-function(x) apply(x, 2, function(x){x[is.na(x)]<-median(x, na.rm=T); x})

I'm sure if what you're looking for is performance, someone will provide a data table solution (unfortunately I am not familiar with that package so can't do myself).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!