问题
The language I'm using is R, but you don't necessarily need to know about R to answer the question.
Question: I have a sequence that can be considered the ground truth, and another sequence that is a shifted version of the first, with some missing values. I'd like to know how to align the two.
setup
I have a sequence ground.truth
that is basically a set of times:
ground.truth <- rep( seq(1,by=4,length.out=10), 5 ) +
rep( seq(0,length.out=5,by=4*10+30), each=10 )
Think of ground.truth
as times where I'm doing the following:
{take a sample every 4 seconds for 10 times, then wait 30 seconds} x 5
I have a second sequence observations
, which is ground.truth
shifted with 20% of the values missing:
nSamples <- length(ground.truth)
idx_to_keep <- sort(sample( 1:nSamples, .8*nSamples ))
theLag <- runif(1)*100
observations <- ground.truth[idx_to_keep] + theLag
nObs <- length(observations)
If I plot these vectors this is what it looks like (remember, think of these as times):
What I've tried. I want to:
- calculate the shift (
theLag
in my example above) - calculate a vector
idx
such thatground.truth[idx] == observations - theLag
First, assume we know theLag
. Note that ground.truth[1]
is not necessarily observations[1]-theLag
. In fact, we have ground.truth[1] == observations[1+lagI]-theLag
for some lagI
.
To calculate this, I thought I'd use cross-correlation (ccf
function).
However, whenever I do this I get a lag with a max. cross-correlation of 0, meaning ground.truth[1] == observations[1] - theLag
. But I've tried this in examples where I've explicitly made sure that observations[1] - theLag
is not ground.truth[1]
(i.e. modify idx_to_keep
to make sure it doesn't have 1 in it).
The shift theLag
shouldn't affect the cross-correlation (isn't ccf(x,y) == ccf(x,y-constant)
?) so I was going to work it out later.
Perhaps I'm misunderstanding though, because observations
doesn't have as many values in it as ground.truth
? Even in the simpler case where I set theLag==0
, the cross correlation function still fails to identify the correct lag, which leads me to believe I'm thinking about this wrong.
Does anyone have a general methodology for me to go about this, or know of some R functions/packages that could help?
Thanks a lot.
回答1:
For the lag, you can compute all the differences (distances) between your two sets of points:
diffs <- outer(observations, ground.truth, '-')
Your lag should be the value that appears length(observations)
times:
which(table(diffs) == length(observations))
# 55.715382960625
# 86
Double check:
theLag
# [1] 55.71538
The second part of your question is easy once you have found theLag
:
idx <- which(ground.truth %in% (observations - theLag))
回答2:
The following should work if your time series are not too long.
You have two vectors of time-stamps, the second one being a shifted and incomplete copy of the first, and you want to find by how much it was shifted.
# Sample data
n <- 10
x <- cumsum(rexp(n,.1))
theLag <- rnorm(1)
y <- theLag + x[sort(sample(1:n, floor(.8*n)))]
We can try all possible lags and, for each one, compute how bad the alignment is, by matching each observed timestamp with the closest "truth" timestamp.
# Loss function
library(sqldf)
f <- function(u) {
# Put all the values in a data.frame
d1 <- data.frame(g="truth", value=x)
d2 <- data.frame(g="observed", value=y+u)
d <- rbind(d1,d2)
# For each observed value, find the next truth value
# (we could take the nearest, on either side,
# but it would be more complicated)
d <- sqldf("
SELECT A.g, A.value,
( SELECT MIN(B.value)
FROM d AS B
WHERE B.g='truth'
AND B.value >= A.value
) AS next
FROM d AS A
WHERE A.g = 'observed'
")
# If u is greater than the lag, there are missing values.
# If u is smaller, the differences decrease
# as we approach the lag.
if(any(is.na(d))) {
return(Inf)
} else {
return( sum(d$`next` - d$value, na.rm=TRUE) )
}
}
We can now search for the best lag.
# Look at the loss function
sapply( seq(-2,2,by=.1), f )
# Minimize the loss function.
# Change the interval if it does not converge,
# i.e., if it seems in contradiction with the values above
# or if the minimum is Inf
(r <- optimize(f, c(-3,3)))
-r$minimum
theLag # Same value, most of the time
来源:https://stackoverflow.com/questions/10220580/aligning-sequences-with-missing-values