I have two matrices of normalized read counts for control and treatment in a time series day1 to day26. I want to calculate distance matrix by Dynamic Time Wrapping afterwar
If your question is "why am I getting this error?" the answer is that you're trying to subset a matrix, which is a two dimensional array, according to a 3rd dimension.
see:
dim(lead)
# [1] 26 2603
lead[,,6.418423] # yes, that's the value j has the first time through the loop
# This will reproduce your error
lead[,,1]
# This will also reproduce your error
Hopefully you can see now that you have a few problems:
i
and j
values are the values in control
and lead
respectively. You can use them as their values, or you can generate the index, e.g., for(i in seq_along(control)
if you're planning to use it for something other than getting that same value out.dist
function. dist
takes a single matrix and computes the distance between its rows. You seem to be trying to pass it two values from two different matrices, or perhaps two subsets of two different matrices. It looks like you might need to go back and look at the examples in the documentation for xtr
There have already been questions similar to yours, but the answers haven't been too detailed. Here's a breakdown of what you need to know, in the specific case of R.
The proxy package is made specifically for the calculation of cross-distance matrices. You should check its vignette to know which measures are already implemented by it. An example of its use:
set.seed(1L)
sample_data <- matrix(rnorm(50L), nrow = 5L, ncol = 10L)
suppressPackageStartupMessages(library(proxy))
distance_matrix <- proxy::dist(sample_data, method = "euclidean",
upper = TRUE, diag = TRUE)
print(distance_matrix)
#> 1 2 3 4 5
#> 1 0.000000 2.636027 3.834764 5.943374 3.704322
#> 2 2.636027 0.000000 2.587398 4.515470 2.310364
#> 3 3.834764 2.587398 0.000000 4.008678 3.899561
#> 4 5.943374 4.515470 4.008678 0.000000 5.059321
#> 5 3.704322 2.310364 3.899561 5.059321 0.000000
Note: in the context of time series,
proxy
treats each row in a matrix as a series,
which can be confirmed by the fact that sample_data
above is a 5x10
matrix and the resulting cross-distance matrix is 5x5
.
The dtw package implements many variations of DTW,
and it also leverages proxy
.
You could calculate a DTW distance matrix with:
suppressPackageStartupMessages(library(dtw))
dtw_distmat <- proxy::dist(sample_data, method = "dtw",
upper = TRUE, diag = TRUE)
print(distance_matrix)
#> 1 2 3 4 5
#> 1 0.000000 2.636027 3.834764 5.943374 3.704322
#> 2 2.636027 0.000000 2.587398 4.515470 2.310364
#> 3 3.834764 2.587398 0.000000 4.008678 3.899561
#> 4 5.943374 4.515470 4.008678 0.000000 5.059321
#> 5 3.704322 2.310364 3.899561 5.059321 0.000000
One nice thing about proxy
is that it gives you the option to register custom functions.
You seem to be interested in the normalized version of DTW,
so you could do something like this:
ndtw <- function(x, y = NULL, ...) {
dtw::dtw(x, y, ..., distance.only = TRUE)$normalizedDistance
}
pr_DB$set_entry(
FUN = ndtw,
names = "ndtw",
loop = TRUE,
distance = TRUE
)
ndtw_distmat <- proxy::dist(sample_data, method = "ndtw",
upper = TRUE, diag = TRUE)
print(ndtw_distmat)
#> 1 2 3 4 5
#> 1 0.0000000 0.4046622 0.5075772 0.6789465 0.5290478
#> 2 0.4046622 0.0000000 0.3630849 0.4866252 0.3612722
#> 3 0.5075772 0.3630849 0.0000000 0.5678698 0.3303344
#> 4 0.6789465 0.4866252 0.5678698 0.0000000 0.5078112
#> 5 0.5290478 0.3612722 0.3303344 0.5078112 0.0000000
See the documentation of pr_DB
for more information.
The dtwclust package
(which I made)
implements a basic but faster version of DTW which can use multi-threading and also leverages proxy
:
suppressPackageStartupMessages(library(dtwclust))
dtw_basic_distmat <- proxy::dist(sample_data, method = "dtw_basic", normalize = TRUE)
print(dtw_basic_distmat)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.0000000 0.4046622 0.5075772 0.6789465 0.5290478
#> [2,] 0.4046622 0.0000000 0.3630849 0.4866252 0.3612722
#> [3,] 0.5075772 0.3630849 0.0000000 0.5678698 0.3303344
#> [4,] 0.6789465 0.4866252 0.5678698 0.0000000 0.5078112
#> [5,] 0.5290478 0.3612722 0.3303344 0.5078112 0.0000000
The dtw_basic
implementation only supports two step patterns and one window type,
but it is considerably faster:
suppressPackageStartupMessages(library(microbenchmark))
microbenchmark(
proxy::dist(sample_data, method = "dtw", window.type = "sakoechiba", window.size = 5L),
proxy::dist(sample_data, method = "dtw_basic", window.size = 5L)
)
Unit: microseconds
expr min lq mean
proxy::dist(sample_data, method = "dtw", window.type = "sakoechiba", window.size = 5L) 5279.124 5621.742 6070.069
proxy::dist(sample_data, method = "dtw_basic", window.size = 5L) 657.966 710.418 776.474
median uq max neval cld
5802.354 6348.199 10411.000 100 b
752.282 814.037 1161.626 100 a
Another multi-threaded implementation is included in the parallelDist package, although I haven't personally tested it.
A single multivariate series is commonly a matrix where time spans the rows and the multiple variables span the columns. DTW also works for them:
mv_series1 <- matrix(rnorm(15L), nrow = 5L, ncol = 3L)
mv_series2 <- matrix(rnorm(15L), nrow = 5L, ncol = 3L)
print(dtw_distance <- dtw_basic(mv_series1, mv_series2))
#> [1] 22.80421
The nice thing about proxy
is that it can calculate distances between objects contained in lists too,
so you can put several multivariate series in lists of matrices:
mv_series <- lapply(1L:5L, function(dummy) {
matrix(rnorm(15L), nrow = 5L, ncol = 3L)
})
mv_distmat_dtwclust <- proxy::dist(mv_series, method = "dtw_basic")
print(mv_distmat_dtwclust)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.00000 27.43599 32.14207 36.42211 31.19279
#> [2,] 27.43599 0.00000 20.88470 23.88436 29.73219
#> [3,] 32.14207 20.88470 0.00000 22.14376 29.99899
#> [4,] 36.42211 23.88436 22.14376 0.00000 28.81111
#> [5,] 31.19279 29.73219 29.99899 28.81111 0.00000
Regardless of what you choose,
you can probably use proxy
to get your result,
but since you haven't provided your whole data,
I can't give you a more specific example.
I presume that dtwclust::dtw_basic(control[, 1:4], lead[, 1:4], normalize = TRUE)
would give you the distance between one pair of series,
assuming you're treating each one as a multivariate series with 4 variables.