What is the best way to store distances with H2O?

▼魔方 西西 提交于 2019-12-11 04:43:10

问题


Supose I have 2 data.frames and I want to calculate the euclidean distance between all of the rows of them. My code is:

set.seed(121)
# Load library
library(h2o)
system.time({
  h2o.init()
  # Create the df and convert to h2o frame format
  df1 <- as.h2o(matrix(rnorm(7500 * 40), ncol = 40))
  df2 <- as.h2o(matrix(rnorm(1250 * 40), ncol = 40))
  # Create a matrix in which I will record the distances
  matrix1 <- as.h2o(matrix(0, nrow = 7500, ncol = 40))
  # Loop to calculate all the distances
  for (i in 1:nrow(df2)){
    matrix1[, i] <- h2o.sqrt(h2o.distance(df1, df2[, i]))
  }
})

I´m sure there is more efficient way to store it into a matrix.


回答1:


You don't need to calculate the distance inside a loop, H2O's distance function can efficiently calculate distances for all the rows. For two data frames with n x k and m x k dimensions, you can find the n x m distance matrix in a following way:

distance_matrix <- h2o.distance(df1, df2, 'l2')

There is no need to take the square root, since h2o.distance() function allows you to specify what distance measure to use: "l1" - Absolute distance (L1 norm), "l2" - Euclidean distance (L2 norm), "cosine" - Cosine similarity and "cosine_sq" - Squared Cosine similarity.

Following your example, the code to calculate the Euclidean distance matrix will be:

library(h2o)
h2o.init()
df1 <- as.h2o(matrix(rnorm(7500 * 40), ncol = 40))
df2 <- as.h2o(matrix(rnorm(1250 * 40), ncol = 40))
distance_matrix <- h2o.distance(df1, df2, 'l2')

resulting in a matrix with dimensions 7500 rows x 1250 columns.



来源:https://stackoverflow.com/questions/45814469/what-is-the-best-way-to-store-distances-with-h2o

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!