问题
Supose I have 2 data.frames and I want to calculate the euclidean distance between all of the rows of them. My code is:
set.seed(121)
# Load library
library(h2o)
system.time({
h2o.init()
# Create the df and convert to h2o frame format
df1 <- as.h2o(matrix(rnorm(7500 * 40), ncol = 40))
df2 <- as.h2o(matrix(rnorm(1250 * 40), ncol = 40))
# Create a matrix in which I will record the distances
matrix1 <- as.h2o(matrix(0, nrow = 7500, ncol = 40))
# Loop to calculate all the distances
for (i in 1:nrow(df2)){
matrix1[, i] <- h2o.sqrt(h2o.distance(df1, df2[, i]))
}
})
I´m sure there is more efficient way to store it into a matrix.
回答1:
You don't need to calculate the distance inside a loop, H2O's distance function can efficiently calculate distances for all the rows. For two data frames with n x k
and m x k
dimensions, you can find the n x m
distance matrix in a following way:
distance_matrix <- h2o.distance(df1, df2, 'l2')
There is no need to take the square root, since h2o.distance() function allows you to specify what distance measure to use: "l1"
- Absolute distance (L1 norm), "l2"
- Euclidean distance (L2 norm), "cosine"
- Cosine similarity and "cosine_sq"
- Squared Cosine similarity.
Following your example, the code to calculate the Euclidean distance matrix will be:
library(h2o)
h2o.init()
df1 <- as.h2o(matrix(rnorm(7500 * 40), ncol = 40))
df2 <- as.h2o(matrix(rnorm(1250 * 40), ncol = 40))
distance_matrix <- h2o.distance(df1, df2, 'l2')
resulting in a matrix with dimensions 7500 rows x 1250 columns
.
来源:https://stackoverflow.com/questions/45814469/what-is-the-best-way-to-store-distances-with-h2o