问题
I have a dataframe of ids and coordinates. I need to calculate the geographic distance between all my ids, drop the ones that are too far from each other, and then go on with my analysis.
I have 30k ids, which would generate a 30k x 30k matrix. Here is a sample:
latitude longitude id
-23.52472 -46.47785 917_62346
-23.62010 -46.69345 244_42975
-23.61636 -46.48148 302_75289
-23.53826 -46.46756 917_96304
-23.58266 -46.54495 302_84126
-23.47005 -46.70921 233_97098
-23.49235 -46.49342 917_62953
-23.52226 -46.72710 244_42245
-23.64853 -46.72237 635_90928
-23.49640 -46.61215 244_2662
x2 = structure(list(latitude = c(-23.5247247, -23.6200954, -23.6163624,
-23.5382557, -23.5826609, -23.4700519, -23.4923465, -23.5222581,
-23.6485288, -23.4964047), longitude = c(-46.4778499, -46.6934512,
-46.4814794, -46.4675563, -46.5449536, -46.7092093, -46.4934192,
-46.7270957, -46.7223717, -46.6121477), id = c("917_62346", "244_42975",
"302_75289", "917_96304", "302_84126", "233_97098", "917_62953",
"244_42245", "635_90928", "244_2662")), .Names = c("latitude",
"longitude", "id"), row.names = c(12041L, 18549L, 13641L, 28386L,
9380L, 6064L, 12724L, 21671L, 18939L, 3396L), class = "data.frame")
First I tried to go straight for it, using the geosphere
package:
library(geosphere)
library(data.table)
d.matrix <- distm(cbind(x2$longitude, x2$latitude))
This does not work, because of memory issues, Error: cannot allocate vector of size 15.4 Gb
. My second attempt was to first generate all the pairwise combinations beforehand, than merge with the original data set to get the lats and lons, and then calculate the distances, such as
dis.long <- expand.grid(x2$id, x2$id)
dis.long <- merge(dis.long, x2, by.x = "Var1", by.y = "id")
dis.long <- merge(dis.long, x2, by.x = "Var2", by.y = "id")
dis.long <- dis.long[ , dist_km2 := distGeo(matrix(c(longitude.x, latitude.x), ncol = 2),
matrix(c(longitude.y, latitude.y), ncol = 2))/1000]
However, expand_grid runs out of memory. This is strange to me, since the resulting matrix would be 900mi rows by 2 cols, and I already deal with data sets way larger (like 200 mi x 50 matrices).
Another observation, I already tried using ids such as new_id = seq(1L,30000L,1L)
, to see whether integers would solve it, but I get the same memory issue when I try to expand.
I am currently under these configurations, besides a 16gb Ram desktop
> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-8 geosphere_1.5-5 sp_1.2-5 haven_1.0.0
[7] stringr_1.2.0 data.table_1.10.4
Can anybody give me an idea of how to calculate these distances? And why I cant generate this specific expand.grid while being able to construct bigger objects?
回答1:
You do not need to compare all-vs-all, which includes self-comparison and directional comparison (A-B != B-A); therefore you should try combn
instead of expand.grid
Your data
x2 = structure(list(latitude = c(-23.5247247, -23.6200954, -23.6163624,
-23.5382557, -23.5826609, -23.4700519, -23.4923465, -23.5222581,
-23.6485288, -23.4964047), longitude = c(-46.4778499, -46.6934512,
-46.4814794, -46.4675563, -46.5449536, -46.7092093, -46.4934192,
-46.7270957, -46.7223717, -46.6121477), id = c("917_62346", "244_42975",
"302_75289", "917_96304", "302_84126", "233_97098", "917_62953",
"244_42245", "635_90928", "244_2662")), .Names = c("latitude",
"longitude", "id"), row.names = c(12041L, 18549L, 13641L, 28386L,
9380L, 6064L, 12724L, 21671L, 18939L, 3396L), class = "data.frame")
expand.grid
OP <- function(df) {
x3 = expand.grid(df$id, df$id)
Y <- merge(x3, df, by.x = "Var1", by.y = "id")
Y <- merge(Y, df, by.x = "Var2", by.y = "id")
return(Y)
}
vs combn
CP <- function(df) {
Did = as.data.frame(t(combn(df$id, 2)))
Z <- merge(Did, df, by.x = "V1", by.y = "id")
Z <- merge(Z, df, by.x = "V2", by.y = "id")
return(Z)
}
Comparison
dis.long <- OP(x2)
object.size(dis.long)
# 7320 bytes
new <- CP(x2)
object.size(new)
# 5016 bytes
Much larger example
num <- 5e2
bigx <- data.frame(latitude=rnorm(num)*-23, longitude=rnorm(num)*-46, id=1:num)
bigdl <- OP(bigx)
object.size(bigdl)
# 10001224 bytes
bignew <- CP(bigx)
object.size(bignew)
# 4991224 bytes
About half the size
来源:https://stackoverflow.com/questions/46004625/how-to-compute-big-geographic-distance-matrix