How to replace outliers with NA having a particular range of values in R?

六月ゝ 毕业季﹏ 提交于 2020-01-24 21:51:06

问题


I have climate data and I'm trying to replace outliers with NA. I'm not using boxplot(x)$out is because I have a range of values to be considered to compute the outlier.

temp_range <- c(-15, 45)
wind_range <- c(0, 15)
humidity_range <- c(0, 100)

My dataframe looks like this

df with outliers

(I highlighted values that should be replaced with NA according to ranges.)

So temp1 and temp2 outliers must be replaced to NA according to temp_range, wind's outliers should be replaced to NA according to wind_range and finally humidity's outliers must be replaced to NA according to humidity_range.

Here is what I've got:

df <- read.csv2("http://pastebin.com/raw/vwqBu2M5", stringsAsFactors = FALSE)

df[,2:5] = apply(df[,2:5], 2, function(x) as.numeric(x))

#Ranges
temp_range <- c(-15, 45)
wind_range <- c(0, 15)
humidity_range <- c(0, 100)

#Function to detect outlier
in_interval <- function(x, interval){
  stopifnot(length(interval) == 2L)
  interval[1] <= x & x <= interval[2]
}


#Replace outliers according to temp_range
cols <- c('temp1', 'temp2')
df[, cols] <- lapply(df[, cols], function(x) {

  x[in_interval(x, temp_range)==FALSE] <- NA
  x
})

I'm doing the last part of code (the replacement) for every range. Is there a way to simplify it so I can avoid a lot of repetition?

Last thing, let's say cols <- c('wind') this throws me a warning and replaces the whole wind column with a constant.

Warning message:
In `[<-.data.frame`(`*tmp*`, , cols, value = list(23.88, 23.93,  :
  provided 10 variables to replace 1 variables

Any suggestions?


回答1:


To do it more dynamically, use a dictionnary: a dataframe with outlier value associate to each variable.

Here I create it in R, but it would be more practical to have it in csv so you can edit it easily.

df <- read.csv2("http://pastebin.com/raw/vwqBu2M5", stringsAsFactors = FALSE)

df[,2:5] = apply(df[,2:5], 2, function(x) as.numeric(x))


df_dict <- data.frame(variable = c("temp1", "temp2", "wind", "humidity"), 
                       out_low = c(-15, -15, 0, 0), 
                       out_high =c(45, 45, 15, 100))

for (var in df_dict$variable) {

  df[[var]][df[[var]] < df_dict[df_dict$variable == var, ]$out_low | df[[var]] > df_dict[df_dict$variable == var, ]$out_high] <- NA

}



回答2:


I think you're making it more complicated than it needs to be. You can use logical vectors to selectively replace only certain values in a variable:

df <- read.csv2("http://pastebin.com/raw/vwqBu2M5", stringsAsFactors = FALSE)

df[,2:5] = apply(df[,2:5], 2, function(x) as.numeric(x))

#Ranges
temp_range <- c(-15, 45)
wind_range <- c(0, 15)
humidity_range <- c(0, 100)

df$temp1[df$temp1 < temp_range[1] | df$temp1 > temp_range[2]] <- NA
df$temp2[df$temp2 < temp_range[1] | df$temp2 > temp_range[2]] <- NA
df$wind[df$wind < wind_range[1] | df$wind > wind_range[2]] <- NA
df$humidity[df$humidity < humidity_range[1] | df$humidity > humidity_range[2]] <- NA

Basically all you're doing is taking a variable, creating a logical vector that selects only values outside of your range, and replacing those values with NA

That will give you the following (which doesn't quite match your image, but the numbers seem correct based on your ranges):

                  time temp2 wind humidity temp1
1  2006-11-22 22:00:00    NA 0.00    56.95 23.88
2  2006-11-22 23:00:00  15.5 0.00    58.21 23.93
3  2006-11-23 00:00:00    NA   NA    62.95 23.81
4  2006-11-23 01:00:00  12.0 0.30    70.15    NA
5  2006-11-23 02:00:00  35.0 0.07    76.46 21.63
6  2006-11-23 03:00:00  12.0 0.79       NA 21.81
7  2006-11-23 04:00:00  35.0 0.50    69.11 21.04
8  2006-11-23 05:00:00  14.0 0.37    71.86 20.32
9  2006-11-23 06:00:00  -9.0 0.26    70.97 20.50
10 2006-11-23 07:00:00    NA 0.03    78.02    NA



回答3:


You can define a function,

check_inRange <- function(col, range) {
   df[col] >= range[1] & df[col] <= range[2]
}

and then for every column, you can call this function as

df[!check_inRange("temp1", temp_range), "temp1"] <- NA
df[!check_inRange("temp2", temp_range), "temp2"] <- NA
df[!check_inRange("wind", wind_range), "wind"] <- NA
df[!check_inRange("humidity", humidity_range), "humidity"] <- NA

This would replace all the values in respective columns which are out of range to NA



来源:https://stackoverflow.com/questions/40210424/how-to-replace-outliers-with-na-having-a-particular-range-of-values-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!