Calculating a distance matrix by dtw

后端 未结 2 543
攒了一身酷
攒了一身酷 2020-12-12 06:02

I have two matrices of normalized read counts for control and treatment in a time series day1 to day26. I want to calculate distance matrix by Dynamic Time Wrapping afterwar

相关标签:
2条回答
  • 2020-12-12 06:39

    If your question is "why am I getting this error?" the answer is that you're trying to subset a matrix, which is a two dimensional array, according to a 3rd dimension.

    see:

    dim(lead)
    # [1] 26 2603
    lead[,,6.418423] # yes, that's the value j has the first time through the loop
    # This will reproduce your error
    lead[,,1]
    # This will also reproduce your error
    

    Hopefully you can see now that you have a few problems:

    1. You're trying to subset a matrix according to a 3rd dimension
    2. Your i and j values are the values in control and lead respectively. You can use them as their values, or you can generate the index, e.g., for(i in seq_along(control) if you're planning to use it for something other than getting that same value out.
    3. Taking it to the next step, it's unclear what you want to pass to the dist function. dist takes a single matrix and computes the distance between its rows. You seem to be trying to pass it two values from two different matrices, or perhaps two subsets of two different matrices. It looks like you might need to go back and look at the examples in the documentation for xtr
    0 讨论(0)
  • 2020-12-12 07:01

    There have already been questions similar to yours, but the answers haven't been too detailed. Here's a breakdown of what you need to know, in the specific case of R.

    Calculating cross-distance matrices

    The proxy package is made specifically for the calculation of cross-distance matrices. You should check its vignette to know which measures are already implemented by it. An example of its use:

    set.seed(1L)
    sample_data <- matrix(rnorm(50L), nrow = 5L, ncol = 10L)
    
    suppressPackageStartupMessages(library(proxy))
    distance_matrix <- proxy::dist(sample_data, method = "euclidean", 
                                   upper = TRUE, diag = TRUE)
    print(distance_matrix)
    #>          1        2        3        4        5
    #> 1 0.000000 2.636027 3.834764 5.943374 3.704322
    #> 2 2.636027 0.000000 2.587398 4.515470 2.310364
    #> 3 3.834764 2.587398 0.000000 4.008678 3.899561
    #> 4 5.943374 4.515470 4.008678 0.000000 5.059321
    #> 5 3.704322 2.310364 3.899561 5.059321 0.000000
    

    Note: in the context of time series, proxy treats each row in a matrix as a series, which can be confirmed by the fact that sample_data above is a 5x10 matrix and the resulting cross-distance matrix is 5x5.

    Using the DTW distance

    The dtw package implements many variations of DTW, and it also leverages proxy. You could calculate a DTW distance matrix with:

    suppressPackageStartupMessages(library(dtw))
    dtw_distmat <- proxy::dist(sample_data, method = "dtw", 
                               upper = TRUE, diag = TRUE)
    print(distance_matrix)
    #>          1        2        3        4        5
    #> 1 0.000000 2.636027 3.834764 5.943374 3.704322
    #> 2 2.636027 0.000000 2.587398 4.515470 2.310364
    #> 3 3.834764 2.587398 0.000000 4.008678 3.899561
    #> 4 5.943374 4.515470 4.008678 0.000000 5.059321
    #> 5 3.704322 2.310364 3.899561 5.059321 0.000000
    

    Using custom distances

    One nice thing about proxy is that it gives you the option to register custom functions. You seem to be interested in the normalized version of DTW, so you could do something like this:

    ndtw <- function(x, y = NULL, ...) {
        dtw::dtw(x, y, ..., distance.only = TRUE)$normalizedDistance
    }
    
    pr_DB$set_entry(
      FUN = ndtw,
      names = "ndtw",
      loop = TRUE,
      distance = TRUE
    )
    
    ndtw_distmat <- proxy::dist(sample_data, method = "ndtw",
                                upper = TRUE, diag = TRUE)
    print(ndtw_distmat)
    #>           1         2         3         4         5
    #> 1 0.0000000 0.4046622 0.5075772 0.6789465 0.5290478
    #> 2 0.4046622 0.0000000 0.3630849 0.4866252 0.3612722
    #> 3 0.5075772 0.3630849 0.0000000 0.5678698 0.3303344
    #> 4 0.6789465 0.4866252 0.5678698 0.0000000 0.5078112
    #> 5 0.5290478 0.3612722 0.3303344 0.5078112 0.0000000
    

    See the documentation of pr_DB for more information.

    Other DTW implementations

    The dtwclust package (which I made) implements a basic but faster version of DTW which can use multi-threading and also leverages proxy:

    suppressPackageStartupMessages(library(dtwclust))
    dtw_basic_distmat <- proxy::dist(sample_data, method = "dtw_basic", normalize = TRUE)
    print(dtw_basic_distmat)
    #>      [,1]      [,2]      [,3]      [,4]      [,5]     
    #> [1,] 0.0000000 0.4046622 0.5075772 0.6789465 0.5290478
    #> [2,] 0.4046622 0.0000000 0.3630849 0.4866252 0.3612722
    #> [3,] 0.5075772 0.3630849 0.0000000 0.5678698 0.3303344
    #> [4,] 0.6789465 0.4866252 0.5678698 0.0000000 0.5078112
    #> [5,] 0.5290478 0.3612722 0.3303344 0.5078112 0.0000000
    

    The dtw_basic implementation only supports two step patterns and one window type, but it is considerably faster:

    suppressPackageStartupMessages(library(microbenchmark))
    microbenchmark(
      proxy::dist(sample_data, method = "dtw", window.type = "sakoechiba", window.size = 5L),
      proxy::dist(sample_data, method = "dtw_basic", window.size = 5L)
    )
    
    Unit: microseconds
                                                                                            expr      min       lq     mean
     proxy::dist(sample_data, method = "dtw", window.type = "sakoechiba",      window.size = 5L) 5279.124 5621.742 6070.069
                                proxy::dist(sample_data, method = "dtw_basic", window.size = 5L)  657.966  710.418  776.474
       median       uq       max neval cld
     5802.354 6348.199 10411.000   100   b
      752.282  814.037  1161.626   100  a
    

    Another multi-threaded implementation is included in the parallelDist package, although I haven't personally tested it.

    Multivariate or multi-dimensional time series

    A single multivariate series is commonly a matrix where time spans the rows and the multiple variables span the columns. DTW also works for them:

    mv_series1 <- matrix(rnorm(15L), nrow = 5L, ncol = 3L)
    mv_series2 <- matrix(rnorm(15L), nrow = 5L, ncol = 3L)
    print(dtw_distance <- dtw_basic(mv_series1, mv_series2))
    #> [1] 22.80421
    

    The nice thing about proxy is that it can calculate distances between objects contained in lists too, so you can put several multivariate series in lists of matrices:

    mv_series <- lapply(1L:5L, function(dummy) {
      matrix(rnorm(15L), nrow = 5L, ncol = 3L)
    })
    
    mv_distmat_dtwclust <- proxy::dist(mv_series, method = "dtw_basic")
    print(mv_distmat_dtwclust)
    #>      [,1]     [,2]     [,3]     [,4]     [,5]    
    #> [1,]  0.00000 27.43599 32.14207 36.42211 31.19279
    #> [2,] 27.43599  0.00000 20.88470 23.88436 29.73219
    #> [3,] 32.14207 20.88470  0.00000 22.14376 29.99899
    #> [4,] 36.42211 23.88436 22.14376  0.00000 28.81111
    #> [5,] 31.19279 29.73219 29.99899 28.81111  0.00000
    

    Your case

    Regardless of what you choose, you can probably use proxy to get your result, but since you haven't provided your whole data, I can't give you a more specific example. I presume that dtwclust::dtw_basic(control[, 1:4], lead[, 1:4], normalize = TRUE) would give you the distance between one pair of series, assuming you're treating each one as a multivariate series with 4 variables.

    0 讨论(0)
提交回复
热议问题