I am trying to detect anomalous values in a time series of climatic data with some missing observations. Searching the web I found many available approaches. Of those, stl d
In the beginning of stl
we find
x <- na.action(as.ts(x))
and soon after that
period <- frequency(x)
if (period < 2 || n <= 2 * period)
stop("series is not periodic or has less than two periods")
That is, stl
expects x
to be ts
object after na.action(as.ts(x))
(otherwise period == 1
). Let us check na.omit
and na.exclude
first.
Clearly, at the end of getAnywhere("na.omit.ts")
we find
if (any(is.na(object)))
stop("time series contains internal NAs")
which is straightforward and nothing can be done (na.omit
does not exclude NAs
from ts
objects). Now getAnywhere("na.exclude.default")
excludes NA
observations, but returns an object of class exclude
:
attr(omit, "class") <- "exclude"
and this is a different situation. As mentioned above, stl
expects na.action(as.ts(x))
to be ts
, but na.exclude(as.ts(x))
is of class exclude
.
Hence if one is satisfied with NAs
exclusion then e.g.
nottem[3] <- NA
frequency(nottem)
# [1] 12
na.new <- function(x) ts(na.exclude(x), frequency = 12)
stl(nottem, na.action = na.new, s.window = "per")
works. In general, stl
does not work with NA
values (i.e. with na.action = na.pass
), it crashes deeper in Fortran (see full source code here):
z <- .Fortran(C_stl, ...
Alternatives to na.new
are not delightful:
na.contaguous
- finds the longest consecutive stretch of non-missing values in a time series object.na.approx
, na.locf
from zoo
or some other interpolation function.As we can see in the paper, there is no some simple procedure for missing values (like approximating them in the very beginning) which could be applied to the time series before calling stl
. So considering the fact that original implementation is quite lengthy I would think about some other alternatives than whole new implementation.
Update: a quite optimal in many aspects choice when having NAs
could be na.approx
from zoo
, so let us check its performance, i.e. compare results of stl
with full data set and results when having some number of NAs
, using na.approx
. I am using MAPE as a measure of accuracy, but only for trend, because seasonal component and remainder crosses zero and it would distort the result. Positions for NAs
are chosen at random.
library(zoo)
library(plyr)
library(reshape)
library(ggplot2)
mape <- function(f, x) colMeans(abs(1 - f / x) * 100)
stlCheck <- function(data, p = 3, ...){
set.seed(20130201)
pos <- lapply(3^(0:p), function(x) sample(1:length(data), x))
datasetsNA <- lapply(pos, function(x) {data[x] <- NA; data})
original <- data.frame(stl(data, ...)$time.series, stringsAsFactors = FALSE)
original$id <- "Original"
datasetsNA <- lapply(datasetsNA, function(x)
data.frame(stl(x, na.action = na.approx, ...)$time.series,
id = paste(sum(is.na(x)), "NAs"),
stringsAsFactors = FALSE))
stlAll <- rbind.fill(c(list(original), datasetsNA))
stlAll$Date <- time(data)
stlAll <- melt(stlAll, id.var = c("id", "Date"))
results <- data.frame(trend = sapply(lapply(datasetsNA, '[', i = "trend"), mape, original[, "trend"]))
results$id <- paste(3^(0:p), "NAs")
results <- melt(results, id.var = "id")
results$x <- min(stlAll$Date) + diff(range(stlAll$Date)) / 4
results$y <- min(original[, "trend"]) + diff(range(original[, "trend"])) / (4 * p) * (0:p)
results$value <- round(results$value, 2)
ggplot(stlAll, aes(x = Date, y = value, colour = id, group = id)) + geom_line() +
facet_wrap(~ variable, scales = "free_y") + theme_bw() +
theme(legend.title = element_blank(), strip.background = element_rect(fill = "white")) +
labs(x = NULL, y = NULL) + scale_colour_brewer(palette = "Set1") +
lapply(unique(results$id), function(z)
geom_text(data = results, colour = "black", size = 3,
aes(x = x, y = y, label = paste0("MAPE (", id, "): ", value, "%"))))
}
nottem
, 240 observations
stlCheck(nottem, s.window = 4, t.window = 50, t.jump = 1)
co2
, 468 observations
stlCheck(log(co2), s.window = 21)
mdeaths
, 72 observations
stlCheck(mdeaths, s.window = "per")
Visually we do see some differences in trend in cases 1 and 3. But these differences are pretty small in 1 and also satisfactory in 3 considering sample size (72).
Realize this is an old question, but thought I'd update since there is a newer stl
package available in R called stlplus
. Here is its homepage on github. You can install it from CRAN with install.packages("stlplus")
or directly from github with devtools::install_github("hafen/stlplus")
.