I want to replace missing values in columns of a dataframe. I have written the following code
MedianImpute <- function(data=data)
{
for(i in 1:ncol(data))
{
if(class(data[,i]) %in% c("numeric","integer"))
{
if(sum(is.na(data[,i])))
{
data[is.na(data[,i]),i] <-
median(data[,i],na.rm = TRUE)
}
}
}
return(data)
}
This returns the dataframe with the NAs replaced by the column median. I do no want to use for loop, how can I get the same result using any of the apply functions in R?
This is actually a subtle problem, so worth a bit of discussion (IMO). You have a data frame
and want to impute medians for numeric columns only, with the result being, of course, a data frame.
The apply(...)
function will coerce it's argument to a matrix first. Since all elements in a matrix must by definition be the same data type, if there are any character or factor columns in the original df, the whole matrix will be coerced to char when it is passed to apply(...)
.
# 1st column of df is a factor
df <- data.frame(a=letters[1:5],x=sample(1:5,5),y=runif(5))
df[3,]$x <- NA
df[5,]$y <- NA
df
# a x y
# 1 a 5 0.5235779
# 2 b 3 0.2142011
# 3 c NA 0.8886608
# 4 d 4 0.4952574
# 5 e 1 NA
apply(df,2,function(x) {
if(is.numeric(x)) ifelse(is.na(x),median(x,na.rm=T),x) else x})
# a x y
# [1,] "a" " 5" "0.5235779"
# [2,] "b" " 3" "0.2142011"
# [3,] "c" NA "0.8886608"
# [4,] "d" " 4" "0.4952574"
# [5,] "e" " 1" NA
sapply(df,FUN=f)
will pass the columns of df
individually to a function f(...)
, but, the result will be matrix. So, for example, any factors in df
will be coerced to integer.
sapply(df,function(x) {
if(is.numeric(x)) ifelse(is.na(x),median(x,na.rm=T),x) else x})
# a x y
# [1,] 1 5.0 0.5235779
# [2,] 2 3.0 0.2142011
# [3,] 3 3.5 0.8886608
# [4,] 4 4.0 0.4952574
# [5,] 5 1.0 0.5094176
So here, df$x
and df$y
are correct,but look what happened to df$a
: the factor was coerced to numeric by returning the factor levels - not what you want!
lapply(df,FUN=F)
will return a list, which can then be converted to a data frame. This approach gives you the desired result:
data.frame(lapply(df,function(x) {
if(is.numeric(x)) ifelse(is.na(x),median(x,na.rm=T),x) else x}))
# a x y
# 1 a 1.0 0.3093707
# 2 b 3.0 0.3486391
# 3 c 3.5 0.8292446
# 4 d 5.0 0.7882574
# 5 e 4.0 0.5684483
I suppose it's debatable whether this is any better than using a loop...
You could use apply
to apply a function across all columns
dat<-data.frame(c1=c(1,2,3,NA),c2=c(10, NA, 20, 30))
apply(dat, 2, function(x) ifelse(is.na(x), median(x, na.rm=T), x))
slightly faster
imputeMedianv3<-function(x) apply(x, 2, function(x){x[is.na(x)]<-median(x, na.rm=T); x})
I'm sure if what you're looking for is performance, someone will provide a data table solution (unfortunately I am not familiar with that package so can't do myself).
来源:https://stackoverflow.com/questions/23242389/median-imputation-using-sapply