问题
I have climate data and I'm trying to replace outliers with NA
.
I'm not using boxplot(x)$out
is because I have a range of values to be considered to compute the outlier.
temp_range <- c(-15, 45)
wind_range <- c(0, 15)
humidity_range <- c(0, 100)
My dataframe looks like this
df with outliers
(I highlighted values that should be replaced with NA according to ranges.)
So temp1
and temp2
outliers must be replaced to NA
according to temp_range
, wind
's outliers should be replaced to NA
according to wind_range
and finally humidity
's outliers must be replaced to NA
according to humidity_range
.
Here is what I've got:
df <- read.csv2("http://pastebin.com/raw/vwqBu2M5", stringsAsFactors = FALSE)
df[,2:5] = apply(df[,2:5], 2, function(x) as.numeric(x))
#Ranges
temp_range <- c(-15, 45)
wind_range <- c(0, 15)
humidity_range <- c(0, 100)
#Function to detect outlier
in_interval <- function(x, interval){
stopifnot(length(interval) == 2L)
interval[1] <= x & x <= interval[2]
}
#Replace outliers according to temp_range
cols <- c('temp1', 'temp2')
df[, cols] <- lapply(df[, cols], function(x) {
x[in_interval(x, temp_range)==FALSE] <- NA
x
})
I'm doing the last part of code (the replacement) for every range. Is there a way to simplify it so I can avoid a lot of repetition?
Last thing, let's say cols <- c('wind')
this throws me a warning and replaces the whole wind column with a constant.
Warning message:
In `[<-.data.frame`(`*tmp*`, , cols, value = list(23.88, 23.93, :
provided 10 variables to replace 1 variables
Any suggestions?
回答1:
To do it more dynamically, use a dictionnary: a dataframe with outlier value associate to each variable.
Here I create it in R, but it would be more practical to have it in csv so you can edit it easily.
df <- read.csv2("http://pastebin.com/raw/vwqBu2M5", stringsAsFactors = FALSE)
df[,2:5] = apply(df[,2:5], 2, function(x) as.numeric(x))
df_dict <- data.frame(variable = c("temp1", "temp2", "wind", "humidity"),
out_low = c(-15, -15, 0, 0),
out_high =c(45, 45, 15, 100))
for (var in df_dict$variable) {
df[[var]][df[[var]] < df_dict[df_dict$variable == var, ]$out_low | df[[var]] > df_dict[df_dict$variable == var, ]$out_high] <- NA
}
回答2:
I think you're making it more complicated than it needs to be. You can use logical vectors to selectively replace only certain values in a variable:
df <- read.csv2("http://pastebin.com/raw/vwqBu2M5", stringsAsFactors = FALSE)
df[,2:5] = apply(df[,2:5], 2, function(x) as.numeric(x))
#Ranges
temp_range <- c(-15, 45)
wind_range <- c(0, 15)
humidity_range <- c(0, 100)
df$temp1[df$temp1 < temp_range[1] | df$temp1 > temp_range[2]] <- NA
df$temp2[df$temp2 < temp_range[1] | df$temp2 > temp_range[2]] <- NA
df$wind[df$wind < wind_range[1] | df$wind > wind_range[2]] <- NA
df$humidity[df$humidity < humidity_range[1] | df$humidity > humidity_range[2]] <- NA
Basically all you're doing is taking a variable, creating a logical vector that selects only values outside of your range, and replacing those values with NA
That will give you the following (which doesn't quite match your image, but the numbers seem correct based on your ranges):
time temp2 wind humidity temp1
1 2006-11-22 22:00:00 NA 0.00 56.95 23.88
2 2006-11-22 23:00:00 15.5 0.00 58.21 23.93
3 2006-11-23 00:00:00 NA NA 62.95 23.81
4 2006-11-23 01:00:00 12.0 0.30 70.15 NA
5 2006-11-23 02:00:00 35.0 0.07 76.46 21.63
6 2006-11-23 03:00:00 12.0 0.79 NA 21.81
7 2006-11-23 04:00:00 35.0 0.50 69.11 21.04
8 2006-11-23 05:00:00 14.0 0.37 71.86 20.32
9 2006-11-23 06:00:00 -9.0 0.26 70.97 20.50
10 2006-11-23 07:00:00 NA 0.03 78.02 NA
回答3:
You can define a function,
check_inRange <- function(col, range) {
df[col] >= range[1] & df[col] <= range[2]
}
and then for every column, you can call this function as
df[!check_inRange("temp1", temp_range), "temp1"] <- NA
df[!check_inRange("temp2", temp_range), "temp2"] <- NA
df[!check_inRange("wind", wind_range), "wind"] <- NA
df[!check_inRange("humidity", humidity_range), "humidity"] <- NA
This would replace all the values in respective columns which are out of range to NA
来源:https://stackoverflow.com/questions/40210424/how-to-replace-outliers-with-na-having-a-particular-range-of-values-in-r