Computing pairwise Hamming distance between all rows of two integer matrices/data frames

前端 未结 3 1060
挽巷
挽巷 2021-01-06 14:55

I have two data frames, df1 with reference data and df2 with new data. For each row in df2, I need to find the best (and the second be

相关标签:
3条回答
  • 2021-01-06 15:17

    Fast computation of hamming distance between two integers vectors of equal length

    As I said in my comment, we can do:

    hmd0 <- function(x,y) sum(as.logical(xor(intToBits(x),intToBits(y))))
    

    to compute hamming distance between two integers vectors of equal length x and y. This only uses R base, yet is more efficient than e1071::hamming.distance, because it is vectorized!

    For the example x and y in your post, this gives 25. (My other answer will show what we should do, if we want pairwise hamming distance.)


    Fast hamming distance between a matrix and a vector

    If we want to compute the hamming distance between a single y and multiple xs, i.e., the hamming distance between a vector and a matrix, we can use the following function.

    hmd <- function(x,y) {
      rawx <- intToBits(x)
      rawy <- intToBits(y)
      nx <- length(rawx)
      ny <- length(rawy)
      if (nx == ny) {
        ## quick return
        return (sum(as.logical(xor(rawx,rawy))))
        } else if (nx < ny) {
        ## pivoting
        tmp <- rawx; rawx <- rawy; rawy <- tmp
        tmp <- nx; nx <- ny; ny <- tmp
        }
      if (nx %% ny) stop("unconformable length!") else {
        nc <- nx / ny  ## number of cycles
        return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
        }
      }
    

    Note that:

    1. hmd performs computation column-wise. It is designed to be CPU cache friendly. In this way, if we want to do some row-wise computation, we should transpose the matrix first;
    2. there is no obvious loop here; instead, we use tapply().

    Fast hamming distance computation between two matrices/data frames

    This is what you want. The following function foo takes two data frames or matrices df1 and df2, computing the distance between df1 and each row of df2. argument p is an integer, showing how many results you want to retain. p = 3 will keep the smallest 3 distances with their row ids in df1.

    foo <- function(df1, df2, p) {
      ## check p
      if (p > nrow(df2)) p <- nrow(df2)
      ## transpose for CPU cache friendly code
      xt <- t(as.matrix(df1))
      yt <- t(as.matrix(df2))
      ## after transpose, we compute hamming distance column by column
      ## a for loop is decent; no performance gain from apply family
      n <- ncol(yt)
      id <- integer(n * p)
      d <- numeric(n * p)
      k <- 1:p
      for (i in 1:n) {
        distance <- hmd(xt, yt[,i])
        minp <- order(distance)[1:p]
        id[k] <- minp
        d[k] <- distance[minp]
        k <- k + p
        }
      ## recode "id" and "d" into data frame and return
      id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
      colnames(id) <- paste0("min.", 1:p)
      d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
      colnames(d) <- paste0("mindist.", 1:p)
      list(id = id, d = d)
      }
    

    Note that:

    1. transposition is done at the beginning, according to reasons before;
    2. a for loop is used here. But this is actually efficient because there is considerable computation done in each iteration. It is also more elegant than using *apply family, since we ask for multiple output (row id id and distance d).

    Experiment

    This part uses small dataset to test/demonstrate our functions.

    Some toy data:

    set.seed(0)
    df1 <- as.data.frame(matrix(sample(1:10), ncol = 2))  ## 5 rows 2 cols
    df2 <- as.data.frame(matrix(sample(1:6), ncol = 2))  ## 3 rows 2 cols
    

    Test hmd first (needs transposition):

    hmd(t(as.matrix(df1)), df2[1, ])  ## df1 & first row of df2
    # [1] 2 4 6 2 4
    

    Test foo:

    foo(df1, df2, p = 2)
    
    # $id
    #   min1 min2
    # 1    1    4
    # 2    2    3
    # 3    5    2
    
    # $d
    #   mindist.1 mindist.2
    # 1         2         2
    # 2         1         3
    # 3         1         3
    

    If you want to append some columns to df2, you know what to do, right?

    0 讨论(0)
  • 2021-01-06 15:42

    Please don't be surprised why I take another section. This part gives something relevant. It is not what OP asks for, but may help any readers.


    General hamming distance computation

    In the previous answer, I start from a function hmd0 that computes hamming distance between two integer vectors of the same length. This means if we have 2 integer vectors:

    set.seed(0)
    x <- sample(1:100, 6)
    y <- sample(1:100, 6)
    

    we will end up with a scalar:

    hmd0(x,y)
    # 13
    

    What if we want to compute pairwise hamming distance of two vectors?

    In fact, a simple modification to our function hmd will do:

    hamming.distance <- function(x, y, pairwise = TRUE) {
      nx <- length(x)
      ny <- length(y)
      rawx <- intToBits(x)
      rawy <- intToBits(y)
      if (nx == 1 && ny == 1) return(sum(as.logical(xor(intToBits(x),intToBits(y)))))
      if (nx < ny) {
        ## pivoting
        tmp <- rawx; rawx <- rawy; rawy <- tmp
        tmp <- nx; nx <- ny; ny <- tmp
        }
      if (nx %% ny) stop("unconformable length!") else {
        bits <- length(intToBits(0)) ## 32-bit or 64 bit?
        result <- unname(tapply(as.logical(xor(rawx,rawy)), rep(1:ny, each = bits), sum))
        }
      if (pairwise) result else sum(result)
      }
    

    Now

    hamming.distance(x, y, pairwise = TRUE)
    # [1] 0 3 3 2 5 0
    hamming.distance(x, y, pairwise = FALSE)
    # [1] 13
    

    Hamming distance matrix

    If we want to compute the hamming distance matrix, for example,

    set.seed(1)
    x <- sample(1:100, 5)
    y <- sample(1:100, 7)
    

    The distance matrix between x and y is:

    outer(x, y, hamming.distance)  ## pairwise argument has no effect here
    
    #      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
    # [1,]    2    3    4    3    4    4    2
    # [2,]    7    6    3    4    3    3    3
    # [3,]    4    5    4    3    6    4    2
    # [4,]    2    3    2    5    6    4    2
    # [5,]    4    3    4    3    2    0    2
    

    We can also do:

    outer(x, x, hamming.distance)
    
    #     [,1] [,2] [,3] [,4] [,5]
    # [1,]    0    5    2    2    4
    # [2,]    5    0    3    5    3
    # [3,]    2    3    0    2    4
    # [4,]    2    5    2    0    4
    # [5,]    4    3    4    4    0
    

    In the latter situation, we end up with a symmetric matrix with 0 on the diagonal. Using outer is inefficient here, but it is still more efficient than writing R loops. Since our hamming.distance is written in R code, I would stay with using outer. In my answer to this question, I demonstrate the idea of using compiled code. This of course requires writing a C version of hamming.distance, but I will not show it here.

    0 讨论(0)
  • 2021-01-06 15:42

    Here's an alternative solution that uses only base R, and should be very fast, especially when your df1 and df2 have many rows. The main reason for this is that it does not use any R-level looping for calculating the Hamming distances, such as for-loops, while-loops, or *apply functions. Instead, it uses matrix multiplication for computing the Hamming distance. In R, this is much faster than any approach using R-level looping. Also note that using an *apply function will not necessarily make your code any faster than using a for loop. Two other efficiency-related features of this approach are: (1) It uses partial sorting for finding the best two matches for each row in df2, and (2) It stores the entire bitwise representation of df1 in one matrix (same for df2), and does so in one single step, without using any R-level loops.

    The function that does all the work:

    # INPUT:       
    # X corresponds to your entire df1, but is a matrix
    # Y corresponds to your entire df2, but is a matrix
    # OUTPUT:
    # Matrix with four columns corresponding to the values 
    # that you specified in your question
    fun <- function(X, Y) {
    
      # Convert integers to bits 
      X <- intToBits(t(X))
      # Reshape into matrix
      dim(X) <- c(ncols * 32, nrows)
    
      # Convert integers to bits
      Y <- intToBits(t(Y))
      # Reshape into matrix
      dim(Y) <- c(ncols * 32, nrows)
    
      # Calculate pairwise hamming distances using matrix 
      # multiplication. 
      # Columns of H index into Y; rows index into X.
      # The code for the hamming() function was retrieved
      # from this page:
      # https://johanndejong.wordpress.com/2015/10/02/faster-hamming-distance-in-r-2/
      H <- hamming(X, Y)
    
      # Now, for each row in Y, find the two best matches 
      # in X. In other words: for each column in H, find 
      # the two smallest values and their row indices.
      t(apply(H, 2, function(h) {
        mindists <- sort(h, partial = 1:2)
        c(
          ind1 = which(h == mindists[1])[1],
          val1 = mindists[1],
          hmd2 = which(h == mindists[2])[1],
          val2 = mindists[2]
        )
      }))
    }
    

    To call the function on some random data:

    # Generate some random test data with no. of columns 
    # corresponding to your data
    nrows <- 1000
    ncols <- 26 
    
    # X corresponds to your df1
    X <- matrix(
      sample(1e6, nrows * ncols, replace = TRUE), 
      nrow = nrows, 
      ncol = ncols
    )
    
    # Y corresponds to your df2
    Y <- matrix(
      sample(1e6, nrows * ncols, replace = TRUE), 
      nrow = nrows, 
      ncol = ncols
    )
    
    res <- fun(X, Y)
    

    The above example with 1000 rows in both X (df1) and Y (df2) took about 1.1 - 1.2 seconds to run on my laptop.

    0 讨论(0)
提交回复
热议问题